A common data exploration came up while talking with a British colleague in the advertising industry on Friday, how many independent subject areas should be investigated (1, 10, 100, …, N) in order to have a statistically significant chance of making a discovery with the least amount of effort? An answer can be found in “The Power of Three (3),” an application of the Knowledge Singularity when N is very small, which defines meaningful returns on knowledge discovery costs.
As I discussed in the field note “FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification,” perfect insight can be gained asymptotically as one systematically approaches the Knowledge Singularity (77 independent subject areas out of a N-dimensional universe where N >> 77). While this convergence on infinite knowledge (insight) is theoretically interesting, it is preceded by a more practical application when N is three (3); that is when one explores the combinatorial space of only three subjects.
Let Insight 1 (I_1) represent insights implicit in the set of data 1 (Ds_1), insight 2 (I_2) represent the insights implicit in the set of data 2 (Ds_2), where union of data sets 1 and 2 are null (Ds_1 U Ds_2 = {}). Further, let insight N (I_N) represent the insights implicit in the set of data N (I_N), where union of data set N and all previous data sets are null (Ds_1 U Ds_2 U … U Ds_N = {}). The total insight implicit in all data sets, 1 through N, therefore, is proportional to the insights gained by exploring total combinations of all data sets (from to previous field note). That is,
In order to compute a big data ROI, we need quantify the cost of knowledge discovery. Using current knowledge exploration techniques, the cost of discovering insights in any data set is proportional to the size of the data:
Discovery Cost (I_N) = Knowledge Exploration[Size of Data Set N]
Therefore, a big data ROI could be measured by:
Big Data ROI = Total Insights [Ds_1 … Ds_N] / Total Discovery Cost [Ds_1 … Ds_N]
if we assume the explored data sets to be equal in size (which generally is not the case, but does not matter for this analysis), then:
Discovery Cost (I_1) = Discovery Cost (I_2) = Discovery Cost (I_N)
or
Total Discovery Cost [Ds_1 U Ds_2 U… U Ds_N] = N x Discovery Cost [Ds] = Big O(N) or proportional to N, where Ds is any data size
Thus,
and
We can now plot Big Data ROI as a function of N, for small values of N,
That was fun, but so what? The single biggest ROI in knowledge discovery comes when insights are looked for in and across the very first two combined independent data sets. However, while the total knowledge gained exponentially increases for for each additional independent data set added, the return of investment asymptotically approaches a finite limit as N approaches infinity. One can therefore reasonably argue, that given a limited discovery investment (budget), a minimum of two subjects is needed, while three ensure some level of sufficiency.
Take the advertising market (McCann, Tag, Goodby, etc.), for example. Significant insight development can be gained by exploring the necessary combination of enterprise data (campaign specific data) and social data (how the market reacts) – two independent subject areas. However, to gains some level of assurance, or sufficiency, the addition of one more data set such as IT data (click throughs, induce hits, etc.), increases the overall ROI without materially increasing the costs.
This combination of at least three independent data sets to ensure insightful sufficiency in what is being called “The Power of Three.” While a bit of a mathematical and statistical journey, this intuitively should make sense. Think about the benefits that come from combining subjects like Psychology, Marketing, and Computer Science. While any one or two is great, all three provide the basis for a compelling ability to cause consumer behavior, not just to report on it (computer science) or correlate around it (computer science and market science).