CONFIDENCE INTERVALS(Concept of error correction for quantitative user research)

Naveen Bodapati
8 min readMay 2, 2020

Why Data?

UX designers have numerous methods to improve their designs, such as user interview, focus group, diary study, persona, storyboard, task analysis, customer journey map etc. Many of the methods are intuitive and powerful as they speak a lot about user needs and stories which generate a lot of statistical data.

Types of Data

Nominal Data

Nominal data are groups or categories that are not ordered by numbers.

Ex: Comparing the performance between different user groups, such as “male vs. female” or “frequent user vs. non-frequent user”.

Ordinal Data

Ordinal data are ordered groups or categories.

Ex: When you ask users how often they use your website by choosing from “very often”, “often”, “sometimes” and “rarely”, then the acquired data are ordinal. You can’t analyze the data using an average rank or other statistical methods.

Interval Data

Interval data are continuous data where the distances between each value are meaningful but there are no true zero points.

Ex: Temperature data in Fahrenheit or Celsius. The distance between 10 degrees and 20 degrees are meaningful, but the zero point for temperature is only arbitrary. In UX researches, subjective rating data are often treated as interval data.

Ratio Data

Ratio data are almost the same as interval data, but they have a true zero point.

Ex: When you measure task complete time, the acquired data are ratio data because there is an absolute zero point. Other examples of ratio data include age, height, weight and number of tasks completed.

Quantitative research

Any research that can be measured numerically. It answers questions such as “how many people clicked here” or “what percentage of users are able to find the call to action?” It’s valuable in understanding statistical likelihoods and what is happening on a site or in an app.

How many Participants do you need?

The number of participants needed for research depends on the goals of your research and your tolerance for a margin of error. Generally, you need fewer participants in the first stages more participants in the later stages to find remaining issues.

For quantitative research, what you need to consider is how much statistical errors you can tolerate. When you do the research with fewer participants, your data tend to contain more statistical errors. When you do the research with many participants, your data tend to be closer to the true population. That’s why the confidence interval is important.

Confidence interval

A confidence interval is an estimate of a range of values that includes the true population value for a statistic, such as a mean. You decide what level of confidence you need, such as 90% or 95%, and calculate the confidence interval to show how accurate the measures actually are.

When we run usability studies we are typically targeting a particular demographic, whether that is the general population, students in the college, or men over 30 without marriage as just a few examples.

Example:

Whatever our target audience, we can’t usually test all the people who are in this population due to the feasibility of accessing all these people, and the amount of time and money it would take.

Taking College students as an example, imagine we wanted to test two designs for a student library account to see which design students preferred. In 2019–2020 there were 10 lakhs students studying in the country– that would take a huge amount of time, cost, and effort to ask all those students to rate how attractive they found the two designs.

If we did have access to all the students in the country, and they all gave us their attractiveness ratings, the mean we could calculate for each design would be the actual population mean. The population mean is made up of scores from everyone who fits the demographic we want to test. It always exists, but is largely unknown, as we are very rarely able to test everyone in our population of interest.

What we are doing when we run our usability study, say using 100 students in the college, is taking a sample of the population we are interested in.

This sample size is a lot more manageable and cost-effective and we hope our 100 college students will be representative of the population of students in the country.

From this sample data, we can get an idea of how well received the designs are, and it allows us to make a more informed decision about which design to implement on the website.

The problem with taking samples is sometimes our sample mean will be similar to the population mean, and sometimes we might collect a sample and the mean is actually quite different from the population mean. This is just due to something called sampling error.

The trouble is, we have no idea if our sample mean is a good or poor representation of the population mean. This where confidence intervals come in to help!

The Need for a Measurement

Once we’ve seen three or four test participants in a row fail for the same reason, we just want to get on with fixing the problem.

But sooner or later, we’ll have to tangle with some quantitative data.

We have this goal for a new product that on average, we want users to be able to do a key task within 60 seconds. Assuming we remembered to record the time it took each participant to complete the task.

To get the arithmetic average which statisticians call the mean you add up all the times and divide by the number of participants. Either way, the average time for these participants was 54.75 seconds.

Well, maybe. If our product has only eight users, then we’ve tested with all of them, and yes, we’re done. But what if we’re aiming at everyone? Or, let’s say we’re being more precise, and we’ve defined our target market as

English-speaking Internet users from all countries.

Would the data from eight test participants be enough to represent the experience of all users?

True Population Value Compared to Our Sample

Our challenge … is to work out whether we can consider the average we’ve calculated from our sample as representative of our target audience or to work out whether we can consider the average we’ve calculated from our sample as representative of our target audience.

One way to improve our estimate would be to run more usability tests. So let’s test with eight more participants then, we can calculate a new mean.

For this sample, the arithmetic average comes out to 65.75.6 seconds, so we’ve blown our target. Perhaps we need to run more tests or do more work on the product design. Or is there a quicker way?

The Central Limit Theorem

Theorem: If you take a bunch of samples, then calculate the mean of each sample, most of the sample means cluster close to the true population mean.

Proof:

Five of these samples met the 50-second target(Image Below). The data varies from 10 to 110 seconds, but the means are in a much narrower range.

The chance that any individual mean is way off from the true population mean is quite small. In fact, the Central Limit Theorem also says that means are normally distributed, as in the bell-curve normal distribution

Normal distributions also have very convenient mathematical properties:

Two things define them:

  • Where the peak is that is the mean, which is also the most likely value
  • How spread out the values are which the standard deviation also known as sigma
  • The probability of getting any particular value depends on only these two parameters the mean and the standard deviation.

The one on the left has a smaller mean and standard deviation than the one on the right.

Using the Central Limit Theorem to Find a Confidence Interval

Any mean from a random sample is likely to be quite close to the true population mean, and normal distribution models the chance that it might be different from the true population mean.

Deciding whether our original mean of 54.75 seconds from the first eight participants was sufficiently convincing to show that we’d met our target of average time on task of fewer than 50 seconds and would allow us to launch. We’d rather not run five more rounds of usability tests; instead, we want to estimate the true population mean.

Any mean from a random sample is likely to be quite close to the true population mean, and normal distribution models the chance that it might be different from the true population mean.

Some values of the true population mean would make it very likely that I’d get this sample mean, while other values would make it very unlikely. The likely values represent the confidence interval, which is the range of values for the true population mean that could plausibly give me my observed value.

If we’re aiming for a level of risk that is often stated as statistical significance at p < 0.05, the risk is a 5% chance of being wrong or one in 20, but there is a 95% chance of being right.

The next thing we need is a standard deviation. The only one we have is the standard deviation of our sample, which is 29.40 seconds.

Finally, we plug in the mean, which is 54.75 seconds, and the number of participants, which is 8.

The result:

The 95% confidence interval for the mean is 29.4 to 78.6 seconds, in comparison to our target of 50 seconds.

Confidence and Statistics

You won’t be surprised to learn that it’s the science of statistics that allows us to calculate confidence levels. When we make a statistical comparison we calculate a “p-value” and in general, it’s considered that if the p-value is lower than 5% then the result can be considered statistically significant.

Near Certainty (Confidence Value: 99% or greater)

Publishable in Journals (Confidence Value: 95% of greater)

Good Enough to Make Commercial Decisions (Confidence Value: 90% or greater)

Good Enough to Justify More Research/Development (Confidence Value: 80% or greater)

Better Than Tossing a Coin (Confidence Value: 51% or greater)

Confidence Intervals Can Save You Effort

You can compare the confidence interval you calculated with the target you were aiming for.

The confidence interval for the mean helps you to estimate the true population mean and lets you avoid the additional effort that gathering a lot of extra data would require. You can compare the confidence interval you calculated with the target you were aiming for.

Thanks for reading. 🙏

--

--

Naveen Bodapati

An enthusiastic UX designer optimizing various design principles and research methods to make products usable to a wide-arrayed customer base.