# Data Analysis in market research: working with sample data

Margin of error, confidence level and significance test simplified for analysts with no background in Statistics.

## Hard data vs. sample data

I began my Data Analyst career 7 years ago dealing with **hard data**: sales figures, number of products, cost of manufacturing, etc.

After taking a forced break from the industry and transitioning back into it 4 years later, I found myself working with **sample data **for the first time ever. And suddenly I realised I needed to brush up my knowledge of statistics.

No, Data Analysts do not need to be masters of statistics in order to perform our job, even if we work within market research.

However, working with sample data does mean that you need to have some **notions of statistics in order to bring confidence to your insights**.

That’s why I decided to write an article pulling together basic notions of statistics that you will need when analysing sample data. But, first, let’s clarify a few things:

**What this article IS NOT about:**

- How to calculate things using statistical equations that will make you consider a career change.
- How to pull together a nice sample group for your market research project: Data Analysts are not responsible for this part so we’ll skip it.
- Displaying all the extensive knowledge I have on statistics: simply because I don’t have it nor do I need it in order to perform my job.

**What this article IS about:**

- Explaining what’s behind some statistical terms and what can impact them.
- How to generate insight using sample data that you are confident is not based on coincidences and that your client can use to make business decisions.
- Statistics for dummies: just enough statistics knowledge that Data Analysts need to have in order to perform a position within market research.

## What makes sample data so special?

Let’s start by introducing the example that we will use throughout the article:

Our client is a shoe brand with several stores around the world. They developed a new pair of trainers with innovative features and they’re keen to know what consumers think of it and how they can improve it. They decide to create a survey with a few questions which is sent to shoppers after the purchase of the trainers.

In an ideal world, every single shopper would reply to the survey and the client would end up with **hard, factual data**. However, this is logistically impossible. Some people will not feel like answering the survey or perhaps they’re too busy to do it or they’ll have data privacy concerns. Maybe there could be budget limitations too.

Unless you constrain all shoppers to answer the survey, which would be a bad business decision to make, you will end up with only part of the surveys being answered, that is, you will have the **results from a sample of shoppers.**

So, if we’re going to analyse these answers, **we’d want to make sure it represents as much as possible the opinion of all the shoppers, right**? After all, the clients will base themselves on this analysis to change features of the trainers, invest in marketing campaigns, etc. They want reliable results!

**This is what makes sample data special:** **coincidence or sampling error can influence the results and it’s up to the Data Analyst to identify which results represent the entire group that was sampled as closely as possible**.

Is it possible to be 100% sure that the result obtained would be the same as if all the purchasers had been surveyed?

No, but **you can predict how much the result could have varied if the survey had been answered by all the shoppers.** This is just one of the magical things statistics can do and I’ll show you how but, first, we need to understand some concepts.

## Showing off your wizardry as a Data Analyst

Ok, let’s dive into our example a bit more:

For the 1st round of surveys, 285 shoppers answered the questions. One of the questions asks “Would you recommend this pair of trainers to others?” with “Yes” or “No” answers. The result was that 61% of the 285 shoppers would recommend it to others.

So, you present this result to the client. They are super happy with it as it’s above their target. But then, they ask you:

“Hold on, these trainers were purchased by 2,393 shoppers. How do I know if 61% of all the shoppers think that way? What if this percentage went down if we surveyed the rest of the shoppers?”

You can’t say for sure that 61% of all the shoppers would have answered “Yes” to this question. The only way to know that would be to effectively survey every single shopper.

**However, you can say this:**

“If we surveyed 90% of the shoppers, the result would fall within a variation of plus or minus 2.9 percentage points.”

**Isn’t that a cool statement to make?! Not only would the client think you’re some sort of wizard but they’d also feel much more confident to make investments based on that result, right?**

But, do you need a crystal ball to reach that result? No, you just need to understand these 2 concepts: **confidence level and margin of error.**

**1) Confidence Level**

When we tell the client that **“if we surveyed 90% of the shoppers”**, this is our **confidence level**.

If we upped our confidence level to 95%, based on the sample sizes above, the variation of our result would be higher at 3.5 percentage points.

This is because, if we want to take a higher % of the shoppers, we need to consider results that deviate more from the average.

Let’s take a look at the chart below:

This is a normal distribution chart. **But don’t panic!** Take a deep breath and consider the points below:

- The bell chart shows the probability of a value taking place. The higher the line, the more probable it is.
- The centre of the bell is the average/mean. You can see that most values sit close to the mean as the bell is higher in the centre. And that makes sense, right? For example, if the average height in the UK is 5'9" then most people would be around 5'9" tall.
- Values that deviate more from the mean/average are further away from the centre and you can see that there’s a lower probability of them happening (lower sides of the bell). In fact, if the average height in the UK is 5'9", there’s a lower probability that I’ll find someone who is 6'6" tall.

So, for the same sample size, if you want to give a margin that represents a higher percentage of the population, your result will inevitably include more outliers.

**2) Margin of error:**

Going back to our magical statement, the margin of error is: **“the result would fall within a variation of plus or minus 2.9 percentage points”**

Based on the confidence level we choose to work with, we can tell the client what the variation of the result could be if we surveyed a higher % of the population of shoppers who purchased the trainers.

There’s a formula to calculate the margin of error which can be easily found online and reproduced on Excel. I won’t include it here because I’m more interested in explaining how it works and what affects it.

Your margin of error will depend on:

**Sample size:**The larger your sample, the narrower your margin of error will be for the same confidence level as you’ll be surveying a higher % of the actual population.**Confidence level:**As I explained, it’s the % of the population you want to represent. The higher level you want, the higher your margin of error will be for the same sample size because you’ll need to consider more outliers.**Population proportion:**this is the proportion of people you surveyed from the entire population of people you’d like to analyse. In our example, it’s the 285 shoppers divided by the total of 2,393 shoppers who purchased the trainers.

**What is an acceptable margin of error and confidence level?**

Analysts tend to favour smaller margins of errors in order to give a more precise range. When a company has access to large sample sizes, they can afford to go for a confidence level of 95%.

However, when companies don’t have access to substantial sample sizes due to the nature of what they are researching, they can choose to go for a 90% confidence level in order to still have a narrower and more precise margin of error.

Below, you can see how increasing your confidence level while keeping the sample size intact can increase the number of units of deviation from the mean — also known as Z-score — you need to allow in your margin.

So, in a nutshell, the margin of error is how much a result can vary according to the size of our sample and the % of the population we want to represent.

**But what if we want to compare two values and check if the difference is likely to happen again if a higher % of the population is surveyed?**

## Testing if a difference is significant enough to be highlighted

Let’s go back to our shoe brand client:

We presented the results of the 1st round of surveys. They trusted our results thanks to our low margin of error and high confidence level, so they went away and took measures based on our analysis.

After 6 months, the client ordered a 2nd round of surveys. This time, they surveyed 341 shoppers and they kept the same questions.

In this new round of surveys, 67% of clients answered “Yes” to the question “Would you recommend this pair of trainers to others?”.

Fantastic! Considering the previous result was 61% we can buy some champagne and celebrate with the client because their result improved, right?

“Well, this looks positive, but how do we know this isn’t a coincidence? We didn’t survey the same people twice.”

Saying it’s just your gut feeling will send you and your bottle of champagne right back home. We need a stronger statement to show that our previous analysis led them to make wise business decisions.

How about:

“I can attest that this result increased significantly and if we did survey rounds over and over again, the increase would be seen again 90% of the time.”

When working with sample data, the rule of thumb is to only make strong assertions around the difference between two results when it is significant, that is, **when it passes the significance test (Z-test).**

**The significance test**

The Z-test is a type of significance test that takes in two percentages from two different sample sizes and checks** if the difference between them would happen again according to the chosen confidence level**.

The outcome of the Z-test is a coefficient known as the Z-Score. Remember that we mentioned it before when talking about confidence levels? The Z-Score is the number of units of standard deviation between a result and the average of the general population.

So far, we’ve only seen it in the context of one sample group. But with the Z-test, we can also generate a Z-score that represents the difference between two percentages. We can then compare it to the Z-score of our confidence level and determine whether the increase/decrease is significant or not.

In our example, when comparing 61% out of a sample of 285 and 67% from 341 surveys, the resulting Z-score is 1.56. That is below the Z-score for the confidence level we chose to follow (90% — 1.645). **Therefore, the difference is not significant.**

This means that: **if we repeated this experiment over and over again, we CANNOT affirm that the difference would present itself at least 90% of the time.**

Therefore, save the champagne for later because this result isn’t reliable enough for us to draw strong conclusions from it. In a report, we would not be able to say the result **increased **because we’re not confident enough about its repeatability.

**Requirements for the Z-test:**

First and foremost, you need to have at least 30 samples within each one of the sample groups. This is because normal distribution is required for this test to be performed and larger sample sizes tend to be closer to it.

For sample groups lower than 30, there are other tests that can be performed.

In my job, we only test samples above 30, and anything below that is just “reference” values and we don’t draw any conclusions from them.

Finally, you cannot test a difference if one of the results is 0%.

**How to set up a Z-test in Excel:**

Excel actually has a built-in functionality that allows you to run a Z-test between 2 groups. You can learn more about it here.

Personally, I like to automate my significance test so I can see straight away if a difference is significant or not. In order to do that, I use the formula below to arrive at the Z-score between two percentages:

=(ABS(P1-P2)/SQRT((P1*S1+P2*S2)/(S1+S2)*(1-(P1*S1+P2*S2)/(S1+S2))*(1/S1+1/S2))

P1: % from group 1

P2: % from group 2

S1: Sample size of group 1

S2: Sample size of group 2

If you want to know the statistics behind a Z-test, you can check this article by Egor Howell.

From a Data Analyst perspective, you don’t need to understand every aspect of it, but you just need to be aware of what the Z-Score is, why a result is significant once it’s above a certain threshold, and what it means in your analysis.

*Phew! I’ve been wanting to write this article for a long time now but I knew it was going to be a challenge to make statistical concepts more palatable to Data Analysts with no background in Statistics.*

*I hope this was a good enough attempt, though, and I can’t wait to hear your thoughts!*