Dr Jerry A Smith

Six Types Of Analyses Every Data Scientist Should Know

January 31, 2013 by Dr. J

NewImage Jeffrey Leek, Assistant Professor of Biostatistics at John Hopkins Bloomberg School of Public Health, has identified six(6) archetypical analyses. As presented, they range from the least to most complex, in terms of knowledge, costs, and time. In summary,

Descriptive
Exploratory
Inferential
Predictive
Causal
Mechanistic

1. Descriptive (least amount of effort): The discipline of quantitatively describing the main features of a collection of data. In essence, it describes a set of data.

– Typically the first kind of data analysis performed on a data set

– Commonly applied to large volumes of data, such as census data

-The description and interpretation processes are different steps

– Univariate and Bivariate are two types of statistical descriptive analyses.

– Type of data set applied to: Census Data Set – a whole population

Example: Census Data

2. Exploratory: An approach to analyzing data sets to find previously unknown relationships.

– Exploratory models are good for discovering new connections

– They are also useful for defining future studies/questions

– Exploratory analyses are usually not the definitive answer to the question at hand, but only the start

– Exploratory analyses alone should not be used for generalizing and/or predicting

– Remember: correlation does not imply causation

– Type of data set applied to: Census and Convenience Sample Data Set (typically non-uniform) – a random sample with many variables measured

Example: Microarray Data Analysis

3. Inferential: Aims to test theories about the nature of the world in general (or some part of it) based on samples of “subjects” taken from the world (or some part of it). That is, use a relatively small sample of data to say something about a bigger population.

– Inference is commonly the goal of statistical models

– Inference involves estimating both the quantity you care about and your uncertainty about your estimate

– Inference depends heavily on both the population and the sampling scheme

– Type of data set applied to: Observational, Cross Sectional Time Study, and Retrospective Data Set – the right, randomly sampled population

Example: Inferential Analysis

4. Predictive: The various types of methods that analyze current and historical facts to make predictions about future events. In essence, to use the data on some objects to predict values for another object.

– The models predicts, but it does not mean that the independent variables cause

– Accurate prediction depends heavily on measuring the right variables

– Although there are better and worse prediction models, more data and a simple model works really well

– Prediction is very hard, especially about the future references

– Type of data set applied to: Prediction Study Data Set – a training and test data set from the same population

Example: Predictive Analysis

Another Example of Predictive Analysis

5. Causal: To find out what happens to one variable when you change another.

– Implementation usually requires randomized studies

– There are approaches to inferring causation in non-randomized studies

– Causal models are said to be the “gold standard” for data analysis

– Type of data set applied to: Randomized Trial Data Set – data from a randomized study

Example: Causal Analysis

6. Mechanistic (most amount of effort): Understand the exact changes in variables that lead to changes in other variables for individual objects.

– Incredibly hard to infer, except in simple situations

– Usually modeled by a deterministic set of equations (physical/engineering science)

– Generally the random component of the data is measurement error

– If the equations are known but the parameters are not, they may be inferred with data analysis

– Type of data set applied to: Randomized Trial Data Set – data about all components of the system

Example: Mechanistic Analysis

Please also check out my dedicate blog on data science.

Posted in Data Science | Tagged Heritage Provider Network, Jeffrey Leek, John Hopkins | Leave a Comment »

11 Steps To Finding Data Scientists

January 30, 2013 by Dr. J

Data scientist recruiting can be a challenging task, but not an impossible one. Here are eleven tips that can get you going in the right recruiting direction:

1. Focus recruiting at the universities that have top notch computer programming, statistical, and advance sciences. For example, Stanford, MIT, Berkeley, and Harvard are some of the top schools in the world. Also a few other schools with proven strengths in data analytics, such as: North Carolina State, UC Santa Cruz, University of Maryland, University of Washington, and UT Austin.

2. Look for recruits in the membership rolls of user groups devoted to data science tools. Two excellent places to start are The R User Group (for an open-souce statistical tool favored by data scientists) and Python Interest Groups (for PIGies). Revolutions provide a list of known R User Groups, as well as information around the R community.

3. Search for data scientists on LinkedIn, many of which have formed formal groups.

4. Hang out with data scientists at Strata, Structure:Data, and Hadoop World conferences and similar gatherings or at inform data scientist “meet-ups” in your area. The R User Group Meetup Groups is an excellent source for finding meetings your a particular area.

5. Talk with local venture capitalist (Osage, NewSprings, etc.), who is likely to have gotten a variety of big data proposals over the past year.

6. Host a competition on Kaggle (online data science competitions) and/or TopCoder (online coding competitions), the analytical and coding competition websites. One of my favorite Kaggle competitions was the Heritage Provider Network Health Prize – Identified patients who will be admitted to a hospital within the next year using historical claims data.

7. Candidates need to code. Period. So don’t bother with any candidate that doesn’t understand some formal language (R, Python, Java, etc.). Coding skills don’t have to be at a world-class level, but they should be good enough to get by (hacker).

8. The old saying that “we start dying the day we stop learning” is so true of the data science space. Candidates need to have a demonstrable ability to learn about new technologies and methods, since the field of data science is exponentially changing. Have they gotten certificates from Coursa‘s Data Science or Machine Learning course; contributed to open-source projects; or built an online repository of code or data sets (e.g., Quandl) to share?

9. Make sure a candidate can tell a story in the data sets they are analyzing. It is one thing to do the hard analytical work, but another to provide a coherent narrative about a key insights (AKA they can tell a story). Test their ability to communicate with numbers, visually, and verbally.

10. Candidates need to be able to work in the business world. Take a pass on those candidates that get stuck for answers on how their work might apply to your management challenges.

11. Ask candidates about their favorite analysis or insight. Every data scientist should have something in their insights portfolio, applied or academic. Have them break out the laptop (iPad) to walk through their data sets and analyses. It doesn’t matter what the subject is, just that they can walk through the complete data science value chain.

Please check out my dedicated Data Scientist Insights: Exploring The Darkest Places On Earth blog.

Posted in Data Science | Tagged Coursa, hadoop world, linkedin, recruiting, strata | Leave a Comment »

EPOC Stages Revolution in Advertising Industry

January 29, 2013 by Dr. J

NewImage Emotiv is staging a revolution in the advertising industry through a simple, yet ingenious device: EPOC – a $300 neuro-computer interface based on electroencephalography (EEG) technology. The effectiveness of advertising spots (broadcast, print, and digital) need to be quantitatively measured. Period. Gone are the days you paid 10 people to sit in a room and give you their opinion about an ad. Now, whether one is looking to compare advertising creative platforms, competitively analyze industry advertising, or determine what components of a ad make it good or not so good, direct quantitative measurement is the only proven way to go.

Based on the latest developments in neuro-technology, Emotiv presents a revolutionary personal interface for human computer interaction. Emotiv EPOC is a high resolution, multi-channel, wireless neuroheadset. The EPOC uses a set of 14 sensors plus 2 references to tune into electric signals produced by the brain to detect the user’s thoughts, feelings and expressions in real time. The EPOC connects wirelessly to common operating systems (MAC OS, Windows, Linux, etc.).

Various types of emotions are currently detected; such as “Excitement”, “Engagement/Boredom”, “Meditation”, and “Frustration.” But this is only the beginning. Through is publicly available SDK (software development kit), data can be streamed and interpreted directly from the EPOC device itself. This means that neuro-marketing services (process, data analysis, methodology, etc.) can be created to measure the broad range of emotional responses to advertising, leading to neuromarketing-based Emotional Advertising Patterns (EAP), which could be the foundation of competitive intellectual property (IP) for an agency.

Translated, for a bit of CapEx and OpEx out of your current budget, the next generation of quantitatively-based data-driven advertising can be yours. This just might be the catalyst you need to spearhead a next wave of transformations, leading to more competitive and effective advertising investments.

Posted in Neurmarketing, Neuroscience | Tagged Advertising, EEG, Emotiv, EPOC | Leave a Comment »

Neuroscience, Big Data, and Data Science Is Impacting Big Ideas In The Creative World of Advertising

January 23, 2013 by Dr. J

2013 01 23 10 26 42 Moxie Group’s Creative Director Tina Chadwick makes the case that real-time data analytics “brings us tangible facts on how consumers actually react to almost anything.” She makes light of the “notion that 10 people in a room, who volunteered to be there because they got paid and fed,” could truly represent consumer behaviors (psychographics) is a thing of the past. Sadly though, for many advertising companies, this is still the mainstay of their advertising-oriented evaluative methodology.

New capabilities based on neuroscience, integrating machine learning with human intuition, and data science/big data is leading to a new creative processes, which many call NeuroMarketing, the direct measurement of consumer thoughts about advertising through neuroscience. The persuasive effects of an advertising campaign (psychographic response) are contingent upon the emotional alignment of the viewer (targeted demographic); that is, the campaigns buying call to action has a higher likelihood of succeeding when the viewer has a positive emotional response to the material. Through neuroscience we can not directly measure emotional alignment without inducing a Hawthorne Effect.

This is new field of marketing research, founded in neuroscience, that studies consumers’ sensorimotor, cognitive, and affective response to marketing stimuli. It explores how consumer’s brain responses to ads (broadcast, print, digital) and measures how well and how often media engages the areas for attention/emotion/memory/and personal meaning – measures of emotional response. From data science-driven analyses, we can determine:

The effectiveness of the ad to cause a marketing call to action (e.g., buy product, inform, etc)
Components of the ad that are most/least effective (Ad Component Analysis) – identifying what elements make an ad great or not so great.
Effectiveness of a transcreation process (language and culture migration) used to create adverting in different culturally centric markets.

One of the best and most entertaining case studies I have seen for NeuroMarketing was done by Neuro-Insight, a leader in the application of neuroscience for marketing and advertising. Top Gear used their technology to evaluate which cars attract women to which type of men. The results are pretty amazing.

While NeuroMarketing is an emergent field for advertising creation and evaluation, the fundamentals of neuroscience and data science make this an essential transformational capability. For any advertising agency looking to leap frog those older, less agile companies that are stilled anchored in the practices of the 70s, neuromarketing might be the worth looking into.

POSTED FROM: Data Science Insights: Exploring The Darkest Places On Earth

Posted in Big Data, Data Science, NeuroScience/NeuroMarketing | Tagged human intuition, Tina Chadwick, Top Gear | Leave a Comment »

FIELD NOTE: Big Data and the Power of Three

December 1, 2012 by Dr. J

NewImage A common data exploration came up while talking with a British colleague in the advertising industry on Friday, how many independent subject areas should be investigated (1, 10, 100, …, N) in order to have a statistically significant chance of making a discovery with the least amount of effort? An answer can be found in “The Power of Three (3),” an application of the Knowledge Singularity when N is very small, which defines meaningful returns on knowledge discovery costs.

As I discussed in the field note “FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification,” perfect insight can be gained asymptotically as one systematically approaches the Knowledge Singularity (77 independent subject areas out of a N-dimensional universe where N >> 77). While this convergence on infinite knowledge (insight) is theoretically interesting, it is preceded by a more practical application when N is three (3); that is when one explores the combinatorial space of only three subjects.

Let Insight 1 (I_1) represent insights implicit in the set of data 1 (Ds_1), insight 2 (I_2) represent the insights implicit in the set of data 2 (Ds_2), where union of data sets 1 and 2 are null (Ds_1 U Ds_2 = {}). Further, let insight N (I_N) represent the insights implicit in the set of data N (I_N), where union of data set N and all previous data sets are null (Ds_1 U Ds_2 U … U Ds_N = {}). The total insight implicit in all data sets, 1 through N, therefore, is proportional to the insights gained by exploring total combinations of all data sets (from to previous field note). That is,

NewImage

In order to compute a big data ROI, we need quantify the cost of knowledge discovery. Using current knowledge exploration techniques, the cost of discovering insights in any data set is proportional to the size of the data:

Discovery Cost (I_N) = Knowledge Exploration[Size of Data Set N]

Therefore, a big data ROI could be measured by:

Big Data ROI = Total Insights [Ds_1 … Ds_N] / Total Discovery Cost [Ds_1 … Ds_N]

if we assume the explored data sets to be equal in size (which generally is not the case, but does not matter for this analysis), then:

Discovery Cost (I_1) = Discovery Cost (I_2) = Discovery Cost (I_N)

Total Discovery Cost [Ds_1 U Ds_2 U… U Ds_N] = N x Discovery Cost [Ds] = Big O(N) or proportional to N, where Ds is any data size

Thus,

and

We can now plot Big Data ROI as a function of N, for small values of N,

2012 10 06 10 58 53

That was fun, but so what? The single biggest ROI in knowledge discovery comes when insights are looked for in and across the very first two combined independent data sets. However, while the total knowledge gained exponentially increases for for each additional independent data set added, the return of investment asymptotically approaches a finite limit as N approaches infinity. One can therefore reasonably argue, that given a limited discovery investment (budget), a minimum of two subjects is needed, while three ensure some level of sufficiency.

Take the advertising market (McCann, Tag, Goodby, etc.), for example. Significant insight development can be gained by exploring the necessary combination of enterprise data (campaign specific data) and social data (how the market reacts) – two independent subject areas. However, to gains some level of assurance, or sufficiency, the addition of one more data set such as IT data (click throughs, induce hits, etc.), increases the overall ROI without materially increasing the costs.

This combination of at least three independent data sets to ensure insightful sufficiency in what is being called “The Power of Three.” While a bit of a mathematical and statistical journey, this intuitively should make sense. Think about the benefits that come from combining subjects like Psychology, Marketing, and Computer Science. While any one or two is great, all three provide the basis for a compelling ability to cause consumer behavior, not just to report on it (computer science) or correlate around it (computer science and market science).

Posted in Big Data, Field Note | Tagged analytics, knowledge singularity | Leave a Comment »

Big Idea – How Big is Big Data?

October 14, 2012 by Dr. J

NewImage A very good visual introduction to Big Data that combines a comprehensive technical and business description with fun animated graphics. They dispel the myth that big data is only about data that is big. In fact, we know that big data has three characteristics: Volume, Velocity, and Variety.

Posted in Big Data | Tagged animated graphics, visualization | 1 Comment »

Social Media Study Demonstrates Value of Social Proof

September 4, 2012 by Dr. J

NewImage A recent study of buying behavior based on FaceBook-based marketing is a clear demonstration of the psychology of persuasion, specifically social proof. In Robert Cialdini’s seminal work, Influence: The Psychology of Persuasion, he notes the importance that social proof (one of the six pillars of influence) playing in a decision process. This first of a kind study now gives credence, albeit small, that social tools have a measurable impact on buying.

Last year, ComScore profiled the average spend at Target stores across the general population, compared to the average spend from Facebook fans and friends of fans. The research determined that fans of Target on Facebook were 97% more likely to spend at Target, and friends of fans were 51% more likely than the average population to spend at the retailer (the brand has over 18 million “likes”).

NewImage A corresponding study from March used a test and control methodology that aimed to quantify incremental purchase behavior that could be attributed to social media exposure. Research showed that fans who were exposed to Target’s messaging versus fans who weren’t exposed were 19% more likely to buy at Target. Friends of fans who were exposed to the retailer’s messaging were 27% more likely to buy at Target than those not exposed.

While both studies clearly demonstrate the social proof is playing an important, measurable roll, the second study is significant. It demonstrates that indirect influence, if properly triggered, can be a causation for buying. This is the kind of study that platform developers have been searching for, ones that enable measurable social commerce. But this should not come as a surprise to anybody who has studied Cialdini’s work, as it is only one more field study supporting the cause.

Posted in Uncategorized | Tagged Influence, Psychology of Persuasion, Rober Cialdini | 1 Comment »

Big Data – An Infographic Perspective

August 31, 2012 by Dr. J

2012 08 31 16 45 48 CSC is one of the pioneers in the rapidly growing field of big data.As most of us already know, “big data” is changing dramatically right before our eyes – from the amount of data being produced to the way in which it’s structured (or not) and used. One million time as much data is lost each day than is consumed. This trend of big data growth presents enormous challenges, but it also presents incredible business opportunities (Monetization of Data). This big data growth infographic helps you visualize some of the latest trends.

Posted in Big Data | Tagged CSC, monetization, point of view | Leave a Comment »

Data Science: Beyond Intuition – The Movie

August 30, 2012 by Dr. J

NewImage Data science is changing the way we look at business, innovation and intuition. It challenges our subconscious decisions, helps us find patterns and empowers us to ask better questions. Hear from thought leaders at the forefront including Growth Science, IBM, Intel, Inside-BigData.com and the National Center for Supercomputing Applications. This video is an excellent source of information for those that have struggled trying to understanding data science and its value.