Archive for the ‘Data Lakes’ Category

NewImageDefinition: “Extremely scalable analytics – analyzing petabytes of structured and unstructured data at high velocity.”

Definition: “Big data is data that exceeds the processing capacity of conventional data base systems.”

Big Data has three characteristics:

Variety – Structured and unstructured data

Velocity – Time sensitive data that should be used simultaneously with its enterprise data counterparts, in order to maximize value

Volume – Size of data exceeds the nominal storage capacity of the enterprise.


– In 2011, the global output of data was estimated to be 1.8 zettabytes (10^21 bytes)

– 90% of the world data has been created in the last 2 years.

– We create 2.5 quintillion (10^18) bytes of data per day (from sensors, social media posts, digital pictures, etc.)

– The digital world will increase in capacity 44 folds between 2009 and 2020.

– Only 5% of data is being created in structured forms, 95% is largely unstructured.

– 80% of the effort involved in dealing with unstructured data is reconditioning ill-formed data to well-formed data (cleaning it up).

Performance Statistics (I will start tracking more closely):

– Traditional data storage costs approximately $5/GB, but storing the same data using Hadoop only cost $0.25/GB – yep 25cents/GB. Hum!

– FaceBook stores more than 20Petabytes of data across 23,000 cores, with 50Terabytes of raw data being generated per day.

– eBay uses over 2,600 clustered Hadoop servers.

Read Full Post »


Heads Up – This is a stream of consciousness! Please be patient with me while I incrementally refining it over time. Critical feedback is welcome!

There are several different ways to define when data becomes big data. The two traditional approaches are based on some variant of:

— Big is the sample size of data after which the asymptotic properties of the exploratory data analysis (EDA) methods kick in for valid results

— Big is the gross size to the data under investigation (e.g., size of a database, data mart, data warehouse, etc.).

While both of these measures tend to provide an adequate means through which one can discuss the sizing issue, they both are correlative and not causal by nature. But before get in to a more precise definition of big, lets look at some characteristics of data.

NewImageRegardless of what you are told, all data touched or influenced by natural forces (e.g, hand of man, nature, etc.) has structure (even man made randomly generated data). This structure can be either real (provides meaningful insights in the behaviors of interest) or spurious (trivial and/or uncorrelated insights). The bigger the data, the more likely the structure can be found.

Data, at its core, can be describe in terms of three important characteristics: condition, location, and population. Condition is the state of the data readiness for analysis. If one can use it as is, it is “well conditioned.” If the data needs to be preconditioned/transformed prior to analysis, then it is “ill conditioned.” Location is where the data resides, both physically (databased, logs, etc.) and in time (events). Data populations describe how data is grouped around specific qualities and/or characteristics.

Small data represents a random sample of a know population that is not expected to encounter changes in its composition (condition, location, and population) over the targeted time frame. It tends to address specific and well defined problem through straight forward applications of problem-specific methods. In essence, small data is limited to answering questions about what we know we don’t know (second level of knowledge).


Big data, on the other hand, represents multiple, non random samples of unknown populations, shifting in composition (condition, location, and population) within the target interval. Analyzing big data often requires complex analyses that deal with post-hoc problem assessments, where straight forward solutions can not obtained. This is the realm where one discovers and answers questions in area where we don’t know what we don’t know (third level of knowledge).

With this as a basis, we can now identify more precise quantitative measures of data size, more importantly the subjects/independent variables, needed to lift meaningful observations and learnings from its samples.  Data describing simple problems (aka historical debri) are governed by the interaction of small numbers of independent variable or subjects. For example, the distance a car travels can be understood by analyzing two variables over time – initial starting velocity and acceleration. Informative, but not very interesting. The historical debri for complex problems are governed by the interaction of large numbers of independent variables, who solutions often fall into the realm of non-deterministic polynomials (i.e., an analytical closed formed solution can not be found). Consider, for example, the unbounded number of factors that influence the behavior of love.

A measure of the amount of knowledge contained in data can therefore be defined through understanding the total possible state space of the system, which is proportional to all the possible ways (combination and/or permutations) the variable/factors or subjects can interact.  The relative knowledge contained within two variables/subjects (A and B), for example, can be assessed by looking at A alone, then B alone, and then A and B, for a total of 3 combinatorial spaces. Three variables/subjects (A, B, and C) gives use a knowledge state space of 7. Four subjects results in 15. And so on.

An interesting point is that there is a closed form solution,  based on summing up all the possible combinations where the order of knowledge is NOT important, which is:



and where the order of knowledge is important:


A plot of the knowledge space (where order is not important) over the number of variables/subjects looks like:


What this tells us is that as we explore the integration large variable sets (subjects), our ability to truly define/understand complex issues (behaviors) increases exponentially. Note – Where order of knowledge is important, the asymptotical nature (shape) is the same.

More importantly, it gives a direct measure of the number of independent subjects that are needed to complete define a knowledge set. Specifically,

Theorem: The independent interaction from 77 variable/subject areas asymptotically defines all knowledge contained within that space.
In other words, as we identify, integrate, and analyze subjects across 75 independent data sources, we exponentially increase our likelihood of completely defining the characteristics (behaviors) of the systems contained therein.

Big data, therefore, is defined as:

Definition: “Big Data” represents the historical debri (observable data) resulting from the interaction of at between 70 and 77 independent variable/subjects, from which non-random samples of unknown populations, shifting in composition with a targeted time frame, can be taken.

Definition: “Knowledge Singularity” is the maximum theoretical number of independent variables/subjects that, if combined and/or permutated, would represent a complete body of knowledge.

It is in the aggregation of the possible 70-77 independent subject areas (patients, doctors, donors, activists, buyer, good guys, bad guys, shipping, receiving, etc.) from internal and external data sources (logs, tweets, FaceBook, LinkedIn, blogs, databases, data marts, data warehouses, etc.) that the initial challenge resides, for this is the realm of Data Lakes. And this is yet another story.

Lot’s of stuff, some interesting I hope and more to come later, But this is enough as a field not for now.

Read Full Post »

NewImageYou want to find new sources of revenue, then stop over-investing in those enterprise applications because the real value of a business can be found in your under-invested data assets!  The single most important capability that will impact the growth of top line revenue and/or bottom line margin over the next few years will be through Data Monetization. Data science, the principle means through which data is/will be monetized, is a multidisciplinary capability, designed to extract insights from relatively unrelated and often disparate data sources. It is estimated that data science can generate  $2 to $3 of new longterm revenue for every $1 of product/services revenue currently derived by a company that does not use data science. Achieving these multiples, however, will require a fundamental change in not only how we think of our enterprise, but who is key in making it happen.

Data is a byproduct of people, applications, and time. It is the historical digital debris that documents the behavioral existence of our customers, clients, partners, and employees. But most data for the most part, in and of itself, does not tell us anything useful. It can reaffirm what we know (e.g., have many students we have) and often will help use answer questions about what we know we don’t know (e.g., how many students are failing). But no real new value is captured through these understandings. The real exponential grown in actionable knowledge comes from discovering the secretes in what we don’t know we don’t know (e.g., why students are failing and how to help them succeed), which is locked in the disparate structured and unstructured data found in an out of the enterprise.

NewImageDiscovering the deep secrets in these vast data sets will require a different way of thinking, behaving, and investing. While most successful organizations spend their time and money on enterprise applications (EA), tomorrows business success will come through developing new capabilities in data sciences. This change will impact/be impacted by:

>> Resources:: Highly skilled resources are needed in order to converting disparate unstructured and structured data directly into revenue (reselling the insights – knowledge = study of information) or indirectly into secondary products/services (insights that lead to new product/services innovations). These highly skilled resources often come from non-engineering fields such as mathematics, statistics, and physics.

>> Data Lakes (AKA Big Data):: This is the data that fits out of the box. It is structured and unstructured data from all sources of data, not just that relevant to the apparent business domain of interest. New distributed/federated means of data aggregation are needed, since no single repository can hold the vastness of the data needed for effective data science (by definition). Data lakes go beyond traditional data schemas, enterprise data architectures, data marts (contains data subjects), and data warehouses (aggregated data subjects and can vary over time), into the area of data lakes. Think of Hadoop for distributed process of mega large data sets.

>> Distributed Analytics and Intelligences (DAI)::  Insights are as relevant as one ability to not only answer the questions we know we don’t know (second level of knowledge), but also to have the capability to identify and address the questions that we don’t know we don’t know (third level of knowledge). Getting knowledge out of very large data volumes, that are structured and unstructured, as well has rapidly changing (high velocity), requires new analytical tools, systems, and enterprise architectures. Think of Pentaho and Pneuron for distributed analytics operating against mega large data sets (data lakes).


Breaking out of the business box you are in can not be done by just defining the box. To be successful, you need the insights that can only come from those areas found in the things you don’t know you don’t know, which is the world of big data and data sciences.




Read Full Post »

NewImageEvery company should have a plan for dealing with the exponential growth in their data. That is, they need a Big Data Strategy. While the term, Big Data, has been thrown around a lot in both business and technical publications, very few stop to define it in such a way that make it a useful and actionable business concept.

Big data is characterized by the dramatic growth in the volume of data (internally generated and from external sources) available to businesses. It is a characteristic of a companies IP generating capability and presents new opportunities for companies to grow revenues through better customer (AKA data) insights. Data growth have been around since the beginning of time, but has become a challenge given the recent improvement in integration, adoption of the cloud, and leveraging of social networks.

There are some very powerful and somewhat overwhelming statistics driving the big data discussion that should make business stop, listen, and think:

– $600 to buy a disk drive that can store all of the world’s musicNewImage

– 5 billion mobile phones in use in 2010

– 30 billion pieces of content shared on Facebook every month

– 40%projected growth in global data generated

– 235 terabytes data collected by the US Library of Congress by April 2011

– 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress

– $300 billion potential annual value to US health care—more than double the total annual health care spending in Spain

– €250 billion potential annual value to Europe’s public sector administration—more than GDP of Greece- $600 billion potential annual consumer surplus from using personal location data globally

– 60% potential increase in retailers’ operating margins possible with big data

– By 2018, the United States alone could face a shortage of 140,000–190,000 more deep analytical talent positions and 1.5 million more data-savvy managers. This alone is hugh!

NewImageMcKinsey Global Institute understands this issue quite well. In there “Big data: The next frontier for innovation, competition, and productivity” report, released May 2011, they identified five ways that Big Data could create value:

1. Big data can unlock significant value by making information transparent and usable at much higher frequency.

2. As organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency nowcasting to adjust their business levers just in time.

3. Big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services.

4. Sophisticated analytics can substantially improve decision-making.

5. Big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).

So, if you are into Big Data, then spend some time mining the MSI report. It is time well $pent.

Read Full Post »