How Data Becomes Knowledge

Data questions and data analysis:
The journey from raw data to actionable knowledge takes us through various stages, each with a different kind of question we can ask of and analysis we can perform on that data, ranging from the simplest through to the very complex. But it will always begin with the question.

Descriptive analysis
This is the simplest kind of analysis, which describes some event or data set in order to get a better understanding of what has happened (descriptive step) and why it happened (interpretive step). This kind of analysis won’t be used to generate decisions as its difficult to generalise from the descriptive data, but is used instead to support understanding.

The most obvious example of descriptive data analysis in a business setting is customer demographic data (age, gender, where they live, etc.). The Google ngram viewer is another example, and tracks incidence of searched-for terms in all available books in a specified date range:

Beautifully, Google makes all its ngram data available for download here, so we can run our own analyses if we wish.

Exploratory analysis
This is an investigative, broad-brush look at the data to try and discover connections or relationships which may exist which weren’t previously obvious. This is often done just on an intuition or a hunch. The outcome isn’t usually conclusive, so not useful/accurate enough to make any predictions or decisions from, but does give a good steer for the direction of future, more-detailed analysis.

An important note to remember, as mentioned in a previous post, is that correlation does not prove causation. When we come to look for a causal relationship, a connection or correlation discovered here does not in itself indicate a clear causal relationship. Two further requirements – temporal antecedence and no third variable causing the change in the variables under review – must also be met. Further analysis would be required in order to confirm or deny a causal relationship.

Inferential analysis
Inferential analysis extrapolates or otherwise generalises and draws conclusions about a larger population from a smaller data set. It will most likely employ statistical models (inferential statistics) in order to deal with uncertainty about both the data and the generalisations resulting from its analysis. The sampling methodology employed is an important aspect of this approach.

An example application in a business setting could include looking at a sub-set of customers and how they respond to a particular marketing campaign or product offering, and making an inference about how that campaign or offering would fare if rolled out to a wider audience. This is particularly useful when undertaking a market test for a potential new product. Restricting the pool marketed to keeps costs to a minimum, while setting the sample population size large enough for the inferred results to be statistically significant allows for a meaningful (yet cost-effective) market test to be conducted.

Predictive analysis
This type of analysis is similar to inferential analysis but goes further and is a degree more complex. Here the goal is to analyse a data set in order to make a prediction about what is likely to happen with another (usually similar) data set, or with the same data set but in a future (usually similar) scenario.

One point to remember is that – similar to correlation not being the same as causation – prediction is not the same as causation. Simple predictive models with large data sets will tend to be the most effective, and care should be taken to ensure the correct variables are being measured. The benefit is that (with access to sufficient data) they can be heavily targeted to a given individual. Examples of highly individualised predictive analysis at work include Amazon’s personalised recommender engine, or the customer purchase targeting model employed by Target.

Causal analysis
Here the objective is to identify and prove that a causal relationship between two variables under review, i.e. a change in one variable will lead directly to a change in another variable. As we saw above (and previously), three criteria must be met in order for there to be a known causal relationship between two variables (let’s call them x and y). Firstly, there has to be some kind of correlation or connection seen between x and y. Secondly, there must be temporal antecedence, that is, x happens first and y happens afterwards. Thirdly, there cannot be another third variable which is causing both x and y to change.

This type of analysis is another level of complexity up from predictive analysis, using randomised studies or control studies to get the best results. Results will tend to be expressed as averages which apply across a whole population, rather than to specific individuals in that population (as predictive analysis can).

An example of causal analysis within business may be to conduct a price variance trial to note effect on responses and hence sales. Another could be testing conversion rates of a new sales letter vs. current control sales letter. To be effective the trial requires randomised selection of the target sample and comparison against a similarly randomised control sample, with all other elements (variables) of the trial remaining equal for both groups.

Mechanistic analysis
A less common and more complex form of analysis which allows a greater understanding of how changes in one set of variables can lead to changes in another set of variables. This can usually only be done in very simple situations or where a highly deterministic mathematical model is at play. Used primarily within physical science or engineering applications, and most likely to be employed where all variables are known and the only randomness is likely to come from some measurement error.

Data definitions:
Within any given population or set of objects/items you’re interested in exploring or analysing, data are the values of the variables existing within that population. Values need not be numerical/quantitative; they could just as equally be qualitative.

Variables are, as the name suggests, objects or items of interest which vary by some amount over some given range. Again these can be numerical, quantitative, continuous, e.g. customer sales over time, salary by age, air pressure with altitude, or they might be qualitative, unordered, discreet ‘things’ such as which web pages a customer views on a particular visit. Variables are the things that get measured, the measurements taken becoming the data points.

Question first, data second:
The exact data set, the variables and so on, will all be driven by (be secondary to) the question to be answered, what needs to be found out. However, often the data that is available (or the format it’s available in) will limit or modified the question that eventually gets asked.

So the data should always be enabled or driven by the kinds of questions which need to be asked. In a business environment, this will be driven in turn by the strategic and managerial needs of the business, what goal’s the company is trying to achieve and hence what kinds of decisions it’s likely to have to make to get there.

Big data: how big is big, and do we really need it?
There’s a lot of discrepancy as to exactly what constitutes ‘big’ data – and of course it’s all in the eye of the beholder and how much computing power (and storage capacity) they have access to. Some say that what constitutes big data is a data set so large that it can’t be done on your local desktop or laptop and off-site storage and manipulation capabilities will need to be deployed to analyse it. Generally these days this means using some cloud functionality to access the required hardware capacity, unless you’re an IBM or a Google or Facebook who have access to the necessary hardware at scale.

Others argue that big data refers to the exponentially increasing volume of data being produced due to the advent of the internet and particularly social media – although there remains the question around just how useful all that ‘data’ actually is.

As hardware costs and sizes fall and storage and processing capacities increase, the extent of data which can be easily and usefully manipulated and analysed by users has grown. We can now collect, store and analyse more data, more cheaply than at any other time. This in itself can lead to new opportunities for forward-facing businesses to leverage that data to gain competitive advantage.

Nevertheless, it’s worth keeping in mind that many problems which need to be solved for a specific purpose will not require such large data sets at all in order to draw useful, meaningful and actionable conclusions. And that is the primary purpose to which data science and data analysis applies itself.

Experimental design: turning data into knowledge:
It is critical to pay attention to both the design of the analysis and the way in which it has been conducted. This is to ensure the analysis is accurate, reproducible, and not prone to error and or liable to produce misleading results. This is true of every stage of the analysis process, from choosing and cleaning the data, through to analysing it and reporting the findings and conclusions.

A key part of this is getting reproducible results, and this comes back to the ‘science’ aspect of data science. Sharing the methods and the programming code can be done through GitHub. Data sets can be shared through GitHub too if they are small, but a better way to share larger data sets (associated with scientific or academic research) is through Figshare. (A good process for sharing raw data sets with others can be found on GitHub here.)

An example of the process for turning data into knowledge looks like this:

A. Inferential Analysis

1. Formulate the question
This should be specific and well-bounded so it can be answered in a scientific and rigorous manner.

2. Define the method or test to be employed
This may involved say an A/B test, which would involve identifying the population to be sampled, selecting a small sub-set of the total population, possibly using some kind of probability argument. Be sure not to fall into various analysis traps such as:
– Confounding: a relationship is spotted between variables A and B so it is assumed A has caused B, whereas in fact it could be far more likely that C has caused both A and B. There really is no true causal relationship between A and B at all, it just appears to be so. In this case C is said to be the ‘confounder’ for the relationship between A and B;
– Spurious correlation: Mistaking correlation for causation, i.e. the observed correlation may not be an actual relationship but due to some other unseen or untested factor

3. Deal with any potential confounders
This is done through use of one or other various techniques:
– Fixing: fixing all variables so they never change, and only changing the one (or two if A/B testing) variable(s) being tested;
– Blocking: designing the test so that variables are minimised or removed;
– Stratifying: if not possible to remove a third variable, stratify or layer the test so that the third variable appears equally across two sample blocks of both the first and second variables tested;
– Randomisation: randomly assign individuals to differently sampled groups to remove any possible bias from the results caused by effects from a confounding variable.

4. Collect or generate an appropriate data sample
Apply the test to the identified sub-set of the total population, say on a random basis for an A/B Test, and collect the resultant sample data.

5. Analyse the sample data using relevant analytical methods
For example, descriptive analysis followed by inferential statistical analysis to establish whether the results achieved from the data sample could be applied to the whole (un-sampled) population, based on the statistical significance of the sampled result, and dependent upon whether there is large variability in the spread of results, whether they are tightly clustered or highly variable, and how clear a difference there is between the two results tracked (high or low variability of difference between the results obtained).

6. Check accuracy of analysis and inferences

7. Reporting findings, conclusions and recommendations

B. Predictive Analysis

Here we wish to develop a predictive model which allows us to know (with a certain level of probability), for any future individual applied to the test case, what the likely outcome will be. The method for designing the test, sampling the population and analysing the results the will therefore be different than shown above.

From the population, take a sample based on some kind of probability argument, and apply the relevant test case to all sampled individuals. Based on some behaviour or some other observation, these sampled individuals split into two training sets. One is a set of individuals which DID perform some kind of behaviour or more meet some criteria, and the other set is those individuals which did NOT.

By analysing the data from these two training sets, we build a predictive function which we can then apply to a new individual in order to predict how they might behave or what criteria they might meet. All other aspects of the process remain as above, and there is still the need to design the test such that any confounding variables are dealt with.

Predictive analysis will often prove more difficult than inferential analysis. Looking at the distribution curve of all observed values, the two possibilities will need to be very differently distributed (with mean averages far apart) in order draw a solid conclusion that, for any given observed value, there was a high probability that falls only on one distribution curve, hence came from one training set not the other. If the means of the two training sets are close together, however, it will be very difficult to draw such a clear conclusion, since an observed value could just as easily have come from either one of the two possible distributions.

When assessing the accuracy of a predictive analytical test, there are some key quantities which will need to be considered. This is referred to with reference to a matrix of actual situation vs. tested result. If the tested result matches correctly to the actual situation this would be termed a True Positive (Test says True and Actual is True) or a True Negative (Test says False and Actual is False), and the opposite would be known as a False Positive (Test says True BUT Actual is False) or a False Negative (Test says False BUT Actual is True).

The various points on the 2×2 matrix (TP, FP, FN, TN) can be quantified in terms of a probability of that outcome occurring:

Sensitivity = probability that IF Actual is True that Test will ALSO be True = P(test says true | actual true)
Specificity = probability that IF Actual is False that Test will ALSO be False = P(test says false | actual false)
Positive prediction value = probability that Actual is True GIVEN THAT Test says True = P(actual true | test says true)
Negative prediction value = probability that Actual is False GIVEN THAT Test says False = P(actual false | test says false)
Accuracy = probability that Test is same as Actual (TP or TN)

A final word of warning!

Beware of ‘data dredging’ – pouring over and over the data looking for an outcome which looks like it must be true because it shows statistical significance (and you exhausted all other possibilities) without ever having had a clear purpose, intent or hypothesis for the analysis before you began.

Here’s a summary of what we’re aiming for:

Good experiments and data analysis methodologies:

Specify at the outset what signal/observance/criteria is being looked for
Include replication, to allow measurement of variability
Measure the variability and compare to a signal/observance/criteria being looked for
Make generalisations back to the whole population in relation to the question being asked/problem being tested for
Are transparent (in programming and in data) and replicable by others
May use inference or prediction, dependent on need/application
Don’t just go data dredging