The effect of small sample size

How precise are our estimates? Confidence intervals

Nội dung chính Show

Jeff Sauro, James R. Lewis, in Quantifying the User Experience (Second Edition), 2016

Best point estimates for a completion rate

With small sample sizes in usability testing it is a common occurrence to have either all participants complete a task or all participants fail (100% and 0% completion rates). Although it is always possible that every single user will complete a task or every user will fail it, it is more likely when the estimate comes from a small sample size. In our experience such claims of absolute task success also tend to make stakeholders dubious of the small sample size. While the sample proportion is often the best estimate of the population completion rate, we have found some conditions where other estimates tend to be slightly better (Lewis and Sauro, 2006). Two other noteworthy estimates of the completion rate are:

•

Laplace method: Add one success and one failure

•

Wilson method: Add two successes and two failures (used as part of the adjusted-Wald interval).

Guidelines on reporting the best completion rate estimate

If you find yourself needing the best possible point estimate of the population completion rate consider the following rules on what to report (in addition to the confidence interval):

If you conduct usability tests in which your task completion rates typically take a wide range of values, uniformly distributed between 0% and 100%, then you should use the Laplace method. The smaller your sample size and the farther your initial estimate of the population completion rate is from 50%, the more you will improve your estimate of the actual completion rate.

If you conduct usability tests in which your task completion rates are roughly restricted to the range of 50–100% (the more common situation in usability testing), then the best estimation method depends on the value of the sample completion rate:

If the sample completion rate is:

Less than or equal to 50%: Use the Wilson method (which you get as part of the process of computing an adjusted-Wald binomial confidence interval).

Between 50% and 90%: Stick with reporting the sample proportion. Any attempt to improve on it is as likely to decrease as to increase the estimate’s accuracy.

Greater than 90% but less than 100%: Apply the Laplace method. DO NOT use Wilson in this range to estimate the population completion rate, even if you have computed a 95% adjusted-Wald confidence interval.

Equal to 100%: Use the Laplace method.

Always use an adjustment when sample sizes are small (n < 20). It does no harm to use an adjustment when sample sizes are larger. Keep in mind that even these guidelines will only slightly improve the accuracy of your estimate of the completion rate, so this is no substitution for computing and reporting confidence intervals.

How accurate are point estimates from small samples?

Even the best point estimate from a sample will differ by some amount from the actual population completion rate. To get an idea of the typical amount of error, we created a Monte Carlo simulator. The simulator compared thousands of small sample estimates to an actual population completion rate. At a sample size of five, on average, the completion rate differed by around 11 percentage points from the population completion rate. Seventy-five percent of the time the completion differed by less than 21 percentage points (see www.measuringu.com/blog/memory-math.php).

The results of this simulation tell us that even a very small sample completion rate isn’t useless even though the width of the 95% confidence interval is rather wide (typically 30+ percentage points). But given any single sample you can’t know ahead of time how accurate your estimate is. The confidence interval will provide a definitive range of plausible values. From a practical perspective, keep in mind that the values in the middle of the interval are more likely than those near the edges. If 95% confidence intervals are too wide to support decision making, then it is may be appropriate to lower the confidence level to 90% or 80%. See “What are reasonable criteria” in Chapter 6 for a discussion of appropriate statistical criteria for industrial decision making.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128023082000035

How Precise Are Our Estimates? Confidence Intervals

Jeff Sauro, James R. Lewis, in Quantifying the User Experience, 2012

Best Point Estimates for a Completion Rate

With small sample sizes in usability testing it is a common occurrence to have either all participants complete a task or all participants fail (100% and 0% completion rates). Although it is possible that every single user will complete a task or every user will fail it, it is less likely when the estimate comes from a small sample size. In our experience, such claims of absolute task success also tend to make stakeholders dubious of the small sample size. While the sample proportion is often the best estimate of the population completion rate, we have found some conditions where other estimates tend to be slightly better (Lewis and Sauro, 2006). Two other noteworthy estimates of the completion rate are:

•

Laplace method: Add one success and one failure.

•

Wilson method: add two successes and two failures (used as part of the adjusted-Wald interval).

Guidelines on Reporting the Best Completion Rate Estimate

If you find yourself needing the best possible point estimate of the population completion rate, consider the following rules on what to report (in addition to the confidence interval).

If you conduct usability tests in which your task completion rates are roughly restricted to the range of 50% to 100% (the more common situation in usability testing), then the best estimation method depends on the value of the sample completion rate. If the sample completion rate is:

Less than or equal to 50%: Use the Wilson method (which you get as part of the process of computing an adjusted-Wald binomial confidence interval).

Between 50% and 90%: Stick with reporting the sample proportion. Any attempt to improve on it is as likely to decrease as to increase the estimate's accuracy.

Greater than 90% but less than 100%: Apply the Laplace method. Do not use Wilson in this range to estimate the population completion rate, even if you have computed a 95% adjusted-Wald confidence interval!

Equal to 100%: Use the Laplace method.

How Accurate Are Point Estimates from Small Samples?

Even the best point estimate from a sample will differ by some amount from the actual population completion rate. To get an idea of the typical amount of error, we created a Monte Carlo simulator. The simulator compared thousands of small-sample estimates to an actual population completion rate. At a sample size of five, on average, the completion rate differed by around 11 percentage points from the population completion rate; 75% of the time the completion differed by less than 21 percentage points (see www.measuringusability.com/blog/memory-math.php).

The results of this simulation tell us that even a very small-sample completion rate isn't useless even though the width of the 95% confidence interval is rather wide (typically 30+ percentage points). But given any single sample, you can't know ahead of time how accurate your estimate is. The confidence interval will provide a definitive range of plausible values. From a practical perspective, keep in mind that the values in the middle of the interval are more likely than those near the edges. If 95 percent confidence intervals are too wide to support decision making, then it may be appropriate to lower the confidence level to 90% or 80%. See Chapter 6 for a discussion of appropriate statistical criteria for industrial decision making.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123849687000035

Did we meet or exceed our goal?

Jeff Sauro, James R. Lewis, in Quantifying the User Experience (Second Edition), 2016

Small sample test

For small sample sizes we use the exact probabilities from the binomial distribution to determine whether a sample completion rate exceeds a particular benchmark. The formula for the binomial distribution is

px=n!x!n−x!px1−pn−x

where x is the number of users who successfully completed the task,

n is the sample size.

The computations are rather tedious to do by hand, but are easily computed using the Excel function BINOMDIST() or the online calculator available at:

measuringu.com/onep.php

The term n! is pronounced “n factorial” and is n×(n−1)×(n−2)×⋯×2×1.

Example 1

During an early stage design test eight out of nine users successfully completed a task. Is there sufficient evidence to conclude that at least 70% of all users would be able to complete the same task?

We have an observed completion rate of 8/9 = 88.9%. Using the exact probabilities from the binomial we can find the probability of obtaining eight or more successes out of nine trials if the population completion rate is 70%. To do so we find the probability of getting exactly eight successes and the probability of getting exactly nine successes.

p8=9!8!9−8!0.7081−0.709−8=9! 8!1!0.05760.301=90.01729=0.1556p9=9!9!9−9!0.7091−0.709 −9=9!9!10.040350.300=0.040351=0.04035

In Excel:

= BINOMDIST(8,9,0.70,FALSE)=0.1556=BINOMDIST(9,9,0.70,FALSE )=0.04035

So the probability of eight or nine successes out of nine attempts is 0.1556 + 0.04035 = 0.1960. In other words, there is an 80.4% chance the completion rate exceeds 70%. Whether this is sufficient evidence largely depends on the context of the test and the consequences of being wrong. This result is not suitable for publication. For many early design tests, however, this is sufficient evidence that efforts are better spent on improving other functions.

The probability we computed here is called an “exact” probability—“exact” not because our answer is exactly correct but because the probabilities are calculated exactly, rather than approximated as they are with many statistical tests such as the t-test. Exact probabilities with small sample sizes tend to be conservative—meaning they overstate the long-term probability and therefore understate the actual chances of having met the completion-rate goal.

Mid-probability

One reason for the conservativeness of exact methods with small sample sizes is that the probabilities have a finite number of possible values instead of taking on any number of values (such as with the t-test). One way to compensate for this discreteness is to simulate a continuous result by using a point in between the exact probabilities—called a midprobability.

In the previous example we’d only use half the probability associated with the observed number of successes plus the entire probability of all values above what we observed. The probability of observing eight out of nine successes given a population probability of 70% is 0.1556. Instead of using 0.1556 we’d use 12(0.1556)=0.07782. We add this half-probability to the probability of nine out of nine successes (0.07782 + 0.04035) which gets us a mid-p value of 0.1182. We would now state that there is an 88.2% chance the completion rate exceeds 70%. Compare this result to the exact p-value of 0.1960 (an 80.4% chance the completion rate exceeds 70%). Due to its method of computation, the mid-p will always look better than the exact-p result.

Although mid-p values tend to work well in practice they are not without controversy (as are many techniques in applied statistics). Statistical mathematicians don’t think much of the mid-p value because taking half a probability doesn’t appear to have a good mathematical foundation—even though it tends to provide better results. Rest assured that its use is not just some fudge factor that tends to work. Its use is justified as a way or correcting for the discreteness in the data like other continuity corrections in statistics. For more discussion on continuity corrections see Gonick and Smith (1993, pp. 82–87).

A balanced recommendation is to compute both the exact-p and mid-p values but emphasize the mid-p (Armitage et al., 2002). When you need just one p-value in applied user research settings, we recommend using the less conservative mid-p value unless you must guarantee that the reported p-value is greater than or equal to the actual long-term probability. This is the same recommendation we gave when computing binomial confidence intervals (see Chapter 3)—use an exact method when you need to be absolutely sure you’ve not understated the actual probability (and just know you’re probably overstating it). For almost all applications in usability testing or user research, using just the mid-p value will suffice. Online calculators often provide the values for both methods (e.g., measuringu.com/onep.php—Fig. 4.5).

Figure 4.5. p and mid-p results for eight successes out of nine attempts compared to criterion of 70% success

Example 2

The results of a benchmarking test showed that 18 out of 20 users were able to complete the task successfully. Is it reasonable to report that at least 70% of users can complete the task?

We have an observed completion rate of 18/20 = 90%. Using the exact probabilities from the binomial we can find the probability of obtaining 18 or more successes out of 20 trials if the population completion rate is 70%. To do so we find the probability of getting exactly 18, 19, and 20 successes.

p18=20!18!20−18!0.70 181−0.7020−18=0.02785p19=20!19!20−19! 0.70191−0.7020−19=0.00684 p20=20!20!20−20!0.70201−0.7020−20= 0.000798

The exact p-value is 0.02785 + 0.00684 + 0.000798 = 0.0355.

The mid-p value is 0.5(0.02785) + 0.00684 + 0.000798 = 0.0216.

Both p-values are below the common α threshold of 0.05 and so both provide compelling evidence that at least 70% of users can complete the task. It’s also a result that’s suitable for publication.

It is generally a good idea to compute a confidence interval with every statistical test because the confidence interval will give you an idea about the precision of your metrics in addition to statistical significance. To compute the confidence interval for a one-sided test, set the confidence level to 90% (because you only care about one tail, this is a one-sided test with α equal to 0.05) and compute the interval—if the interval lies about 0.70, then you’ve provided compelling evidence that at least 70% of users can complete the task. As shown in Fig. 4.6, using the adjusted-Wald confidence interval we get a 90% confidence interval between 73.0% and 97.5%.

Figure 4.6. Ninety percent confidence intervals for 18 of 20 successful task completions

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128023082000047

Did We Meet or Exceed Our Goal?

Jeff Sauro, James R. Lewis, in Quantifying the User Experience, 2012

Small-Sample Test

p(x)=n!x!( n−x)!px(1−p)(n−x)

where

x is the number of users who successfully completed the task

n is the sample size

The computations are rather tedious to do by hand, but are easily computed using the Excel function BINOMDIST() or the online calculator available at www.measuringusability.com/onep.php. The term n! is pronounced “n factorial” and is n×(n−1)×(n−2)× …×2×1.

Example 1

p(8)=9!8!(9−8)!0.78(1−0.7)(9−8)=9!8!(1!)0.0576(0.3)(1)=9(0.01729)=0.1556

p(9)=9!9!(9−9)!0.79(1−0.7)(9−9)=9!9!(1)0.04035(0.3)(0)=0.04035(1)=0.04035

In Excel:

=BINOMDIST(8,9,0.7 ,FALSE)=0.1556=BINOMDIST(9,9,0.7,FALSE) =0.04035

Mid-probability

In the previous example we'd only use half the probability associated with the observed number of successes plus the entire probability of all values above what we observed. The probability of observing eight out of nine successes given a population probability of 70% is 0.1556. Instead of using 0.1556 we'd use 12(0.1556)=0.07782. We add this half-probability to the probability of nine out of nine successes (0.07782 + 0.04035), which gets us a mid-p-value of 0.1182. We would now state that there is an 88.2% chance the completion rate exceeds 70%. Compare this result to the exact p-value of 0.1960 (an 80.4% chance the completion rate exceeds 70%). Due to its method of computation, the mid-p will always look better than the exact-p result.

Although mid-p-values tend to work well in practice they are not without controversy (as are many techniques in applied statistics). Statistical mathematicians don't think much of the mid-p-value because taking half a probability doesn't appear to have a good mathematical foundation—even though it tends to provide better results. Rest assured that its use is not just some fudge-factor that tends to work. Its use is justified as a way of correcting for the discreteness in the data like other continuity corrections in statistics. For more discussion on continuity corrections see Gonick and Smith (1993, pp. 82–87).

A balanced recommendation is to compute both the exact-p and mid-p-values but emphasize the mid-p (Armitage et al., 2002). When you need just one p-value in applied user research settings, we recommend using the less conservative mid-p-value unless you must guarantee that the reported p-value is greater than or equal to the actual long-term probability. This is the same recommendation we gave when computing binomial confidence intervals (see Chapter 3)—use an exact method when you need to be absolutely sure you've not understated the actual probability (and just know you're probably overstating it). For almost all applications in usability testing or user research, using just the mid-p-value will suffice. Online calculators often provide the values for both methods (e.g., www.measuringusability.com/onep.php; see Figure 4.5).

Figure 4.5. p and mid-p results for eight successes out of nine attempts compared to criterion of 70% success.

Example 2

The results of a benchmarking test showed that 18 out of 20 users were able to complete the task successfully. Is it reasonable to report that at least 70% of users can complete the task?

p(18)=20!18!(20−18)!0.718(1−0.7)(20−18) =0.02785

p(19)=20!19!(20−19)!0.719(1−0.7)(20−19)=0.00684

p(20)=20!20!(20−20 )!0.720(1−0.7)(20−20)=0.000798

The exact p-value is 0.02785 + 0.00684 + 0.000798 = 0.0355.

The mid-p-value is 0.5(0.02785) + 0.00684 + 0.000798 = 0.0216.

Both p-values are below the common alpha threshold of 0.05 and so both provide compelling evidence that at least 70% of users can complete the task. It's also a result that's suitable for publication.

It is generally a good idea to compute a confidence interval with every statistical test because the confidence interval will give you an idea about the precision of your metrics in addition to statistical significance. To compute the confidence interval for a one-sided test, set the confidence level to 90% (because you only care about one tail, this is a one-sided test with alpha equal to 0.05) and compute the interval; if the interval lies above 0.7, then you've provided compelling evidence that at least 70% of users can complete the task. As shown in Figure 4.6, using the adjusted-Wald confidence interval we get a 90% confidence interval between 73% and 97.5%.

Figure 4.6. 90% confidence intervals for 18 of 20 successful task completions.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123849687000047

An introduction to correlation, regression, and ANOVA

Jeff Sauro, James R. Lewis, in Quantifying the User Experience (Second Edition), 2016

Confidence intervals for r

Throughout this book we’ve emphasized the importance of supplementing tests of significance with confidence intervals. Essentially, a test of significance of r tells you whether the confidence interval (range of plausibility) contains 0—in other words, whether 0 (no correlation at all) is or is not a plausible value for r given the data. To compute the confidence interval, you first need to transform r into z′. This is necessary because the distribution of r is skewed and this procedure normalizes the distribution.

z′=0.5ln[(1+r)/(1−r)]

This establishes the center of the confidence interval. The margin of error (d) is:

d=z(1−α)n− 3

The confidence interval around z′ is z′ ± d.

After you compute the endpoints of the interval, you need to convert them back to r.

r=exp(2z′)−1exp(2z′)+1

Applying these formulas to the previous example (and showing the Excel method for computing them using the ln and exp functions—see Chapter 3), you get a 95% confidence interval that ranges from 0.12 to 0.97:

z′=0.5*ln((1+0.80)/(1−0.80))=1.0986

d=1.96*(1/sqrt (7−3))=0.98

Upperbound:1.0986+0.98=2.0786

Lowerbound:1.0986−0.98=0.1186

rupper=((exp(2*2.0786)−1)/((exp(2*2.0786)+1)))=0.97

rlower=((exp(2*0.1186)−1)/((exp(2*0.1186)+1)))=0.12

Given such a small sample size, the 95% confidence interval is very wide (from 0.12 to 0.97), but you now know that not only is 0 implausible given the data, 0.10 is also implausible (but 0.15 is plausible). It is important to note that the resulting confidence interval will not be symmetrical around the value of r unless the observed correlation is equal to 0.

How to average correlations

Another use for the z′ transformation

In addition to its use when computing confidence intervals, you also need to use the z′ transformation when averaging correlations either within studies or across multiple studies. For example, in Sauro and Lewis (2009), we were exploring different ways to compute the correlations among prototypical usability metrics like task time and errors. To summarize those results, we needed to average correlations, and to do that properly, we had to (1) transform each correlation to its corresponding z′, (2) compute the average z′, and (3) transform the average z′ to r. Averaging without transformation tends to underestimate the true mean of the correlations.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128023082000102

Volume 5

Maxens Decavèle, ... Alexandre Demoule, in Encyclopedia of Respiratory Medicine(Second Edition), 2022

Observational prospective studies

Fourteen studies prospectively assessed dyspnea in intubated patients without any planned interventions on ventilator settings (Table 2). Dyspnea prevalence was assessed in eight studies and appeared to affect between 11% and 68% of patients. Dyspnea intensity was assessed in 11 studies and ranged from 10% to 63% of a dyspnea rating scale or was rated as moderate-to-severe in 48–92% of cases.

However, in addition to their small sample sizes and single-center design, these studies still present considerable heterogeneity that prevents accurate estimation of dyspnea:

•

Patients: only communicative patients could be evaluated in these studies; some studies only included dyspneic patients, precluding estimation of dyspnea prevalence; some studies included mixed intubated and nonintubated patients, some of whom only received noninvasive ventilation; some studies included less ventilator-dependent patients, who had entered a weaning process.

•

Assessment time: dyspnea was assessed either just before, during, at the end or after a spontaneous breathing trial (SBT), or more generally, at any time point, except during a SBT.

•

Assessment frequency: dyspnea was assessed either once or repeatedly throughout the ICU stay. No data are available concerning the course of dyspnea during the ICU stay, and the impact of the “dyspnea trajectory” on outcome remains unknown.

•

Rating scales: dyspnea was assessed using either a continuous rating scale (visual analog scale) or discontinuous rating scale with (modified Borg scale) or without descriptors (numerical rating scale), or using categorical scales (3- or 5-point Likert scale).

•

Unidimensional tools: dyspnea is by nature multidimensional but there is no study on multidimensional assessment of dyspnea in mechanically ventilated patients.

•

Dichotomous questions: in some studies, dyspnea intensity was quantified only in patients who answered “Yes” to a dichotomous question “Is dyspnea present or absent?,” but the need to use this type of dichotomous question (Yes/No) before assessing dyspnea intensity has not been clearly addressed in the literature. Moreover, almost 40% of patients who answered “No” to the question “Do you experience any breathing difficulties?” presented a nonzero score on a numerical rating scale (Campbell and Templin, 2015). Subjective sensations, such as dyspnea, likely constitute a continuum rather than a dichotomous presence or absence.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780081027233002201

Comparing two designs (or anything else!) using paired sample T-tests

Mike Fritz, Paul D. Berger, in Improving the User Experience Through Practical Data Analytics, 2015

3.1 Introduction

So, how do you feel, UX analytics guru? Did you blow 'em away with your stats prowess? How many impressed looks did you get when you started to talk “p-values?”

Well, don’t get too cocky yet. The scenario we introduced with Mademoiselle La La in the previous chapter was pretty straightforward. You just launched a survey with two different designs to two different groups and just sat back to see which one would win.

The reality is that you often don’t have the luxury of obtaining even the moderately small sample sizes illustrated in the previous chapter. Why? Because much of your job invariably revolves around conducting good, old-fashioned usability tests.

Standard usability tests are usually conducted with small samples sizes of 5–10 because (1) it’s been established that larger samples sizes do not reveal more problems, and (2) conducting traditional lab studies with large populations is both time consuming and expensive.

One of the most contentious issues in the usability-testing field has been the appropriate sample size of participants needed to produce credible results. “How many participants are enough?” is an enduring question for practitioners, who often follow their intuition instead of relying on the research. They can hardly be blamed, since the research is sometimes contradictory. As a key decision that needs to be made before recruiting for the test, the sample size debate only muddies the waters when assessing the reliability of the practice of usability testing.

Virzi (1992), Nielsen and Landauer (1993), and Lewis (1994) have published influential articles on the topic of sample size in usability testing. In these articles, the authors presented a mathematical model for determining the sample size for usability tests. The authors presented empirical evidence for the models and made several important claims:

•

Most usability problems are detected with three to five subjects.

•

Running additional subjects during the same test is unlikely to reveal new information.

•

Most severe usability problems are detected by the first few subjects. However, this claim is supported by Virzi’s data—but not supported by Lewis’ data, or Law and Hvannberg’s data (2004).

Virzi’s stated goal was to improve return on investment in product development by reducing the time and cost involved in product design. Nielsen and Landauer (1993) replicated and extended Virzi’s (1992) original findings and reported case studies that supported their claims for needing only small sample sizes for usability tests.

We usually try to have a sample size of at least 8 for our usability studies. Here’s why:

Although you usually do see the same problems start to repeat after the first couple of participants, there are times when your first couple of participants are outliers. That is, they either zoom through all the tasks without finding a problem or they have difficulty with all of the tasks. In the scenario where your first 2 or 3 participants out of 5 fall into the “unrepresentative” category, you’re relying on only 2 or 3 to “normalize” the data. That’s a risk we’d rather not take.

On the other hand, if you conduct the test with at least 8 participants, you have at least some “unrepresentativeness buffer” even if you run into 2 or 3 unrepresentative data points. That is, the impact of the outliers (which you do not know are outliers at that moment) is blunted somewhat by the larger sample size.

When you conduct your post task-completion rating scales, you’re going to have a much better chance of avoiding the extremely wide (and, thus, not so useful) confidence intervals that can plague a small sample size. That means you’re going to be much more confident of the rating scale results you deliver. And, often, we find that those rating scale results can complement the task completion rates, bolstering your case.

Here’s an example. Let’s assume that only 2 out of 8 participants are able to complete the task “find a pair of running shoes in your size” on the retail clothing site you’re testing. After the test, participants are asked to rate their agreement with the statement “Finding running shoes in my size is easy” on a scale of 1 to 5, where 1 = Strongly Disagree and 5 = Strongly Agree. Let’s assume that there was an even split between “1” and “2” (4 each) for an average of 1.5 and a resulting 95% confidence interval for the true mean mating of 1.5 ± 0.45.

Now assume that you ran the same test with only 4 participants and only 1 out of 4 participants was able to “find a pair of running shoes in your size.” (This is the same 0.25 proportion of successful completions as in the example with a sample size of 8.) Again, after the test, participants are asked to rate their agreement with the statement “Finding running shoes in my size is easy” on a scale of 1 to 5, where 1 = Strongly Disagree and 5 = Strongly Agree. Again, let’s assume that there was an even split between “1”s and “2”s (2 each). This time, you still have an average of 1.5, but now your confidence interval has more than doubled in size to 1.5 ± 0.92!

In either case, your post-test rating scale data complement your task completion data, and bolsters the case that you really do have a problem with users finding shoes in their sizes. But, in the above example, with a sample size of 8, your confidence interval size was less than half of that for a sample size of 4, meaning you have, basically, more than doubled the accuracy or precision in your result. In a nutshell, you’ve greatly bolstered your case that users have a big problem finding running shoes in their size.

The preparation for creating and preparing a test for 4 versus 8 is almost the same. That is, it’s the same amount of work to write up a test plan, define the tasks, get consensus on the tasks, and coordinate the assets for the test whether you’re testing for 4 or 8. Admittedly, it’s going to take longer to recruit and actually run the tests, but it’s probably a difference of only one day of testing. But you will probably be able to report out your findings with much more statistical authority. It’s analogous to making a big pot of chili for Sunday’s football game; the prep time is the same whether you feed 2 or 8, and you’ll invariably have some chili left over.

The larger sample sizes will also decrease the binomial confidence intervals for your actual proportion of task completions; you’ll learn more about this topic in Chapter 4.

For an excellent treatment on sample sizes—and specifically how to calculate exactly the correct number you’ll need for different types of tests—we enthusiastically recommend “Quantifying the User Experience; Practical Statistics for User Research” by Jeff Sauro and James R. Lewis.

So, we know what you might be thinking: “How can I possibly use these advanced statistical techniques with sample sizes this small?” Well, we acknowledge that having only a small sample size is not ideal. But it’s important to realize that, indeed, lots can be done with small sample sizes. The smaller the sample size, the larger the observed effect must be before we view it as a “real” effect, that is, indicative of there truly being an effect. Thus, it follows that when you conduct a test with a small sample size, you should not have an expectation of proving the existence of small, subtle effects. Quite often, in UX work, that’s OK, because, like in most all fields, it’s more important to find the large effects—which often you can find with the relatively small sample size.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128006351000033

Advanced Topics

Colleen McCue, in Data Mining and Predictive Analysis (Second Edition), 2015

15.1 Additional “expert options”

Several expert options, including prior probabilities and costs, have been discussed earlier. While it would be impossible to address every option available with each tool, two additional options are worth mentioning at this point given their potential value and relatively common use.

15.1.1 Boosting

Boosting methods can be used to address extremely small sample sizes or infrequent events. These methods confer additional weight or emphasis to infrequent or underreported events. While these frequently can yield greater overall accuracy, like the limitations associated with the data imputation techniques described in Chapter 6, the heterogeneous nature of many patterns of criminal activity can limit the ability to use approaches like this, particularly if they magnify unusual or spurious findings.

15.1.2 Data Partitioning

The importance of using training and test samples was covered in Chapter 8. Different approaches to training and validating models exist, however, which use slightly different partitioning techniques. For example, a three-sample approach to data partitioning also is used, which includes training, validation, and test samples. Like the partitioning method outlined in Chapter 8, the training sample is used to train or build the model. The difference between this approach and the one described earlier resides in the inclusion of a validation sample. The validation sample is used to provide the first estimate of the accuracy of the model created using the training data. These results frequently are also used to fine-tune the model. Finally, as described earlier, the test sample is used to evaluate the performance of the model on a new set of data.

Additional approaches to data partitioning include the use of different percentages of data to the training and test samples. For example, a model can be trained on 80% of the data and tested on 20%, rather than the 50:50 approaches outlined earlier. This approach to data partitioning can be particularly useful when modeling infrequent or rare events, as it results in an increased number of cases of interest from which to create the model, without over representing unusual or spurious findings, which is a limitation with boosting methods.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128002292000158

Spatial Scale, Problems of

Peter M. Atkinson, in Encyclopedia of Social Measurement, 2005

The Zonation Problem

The zonation component of the MAUP is essentially a problem of small sample size (for aggregate statistics such as the variance). The problem is essentially that the actual realization of the sampling configuration (zonation) may have a major effect on the resulting statistics. For example, consider that a hot spot (in number of cars owned per household) exists in a given locality. If a census unit overlaps this area exactly, then the hot spot will show up clearly. If two units cross the hot spot and average its values with smaller values in neighboring areas, the hot spot will be greatly diminished in magnitude. Such effects are difficult to predict. In consequence, the single zonation provided by census bureaus such as the UK Office for National Statistics may be considered insufficient for mapping purposes. If many alternative realizations (zonations) were provided, the sampling may be adequate, and statistics such as the variance would converge to stable estimates. The problem then is that the spatial resolution is effectively increased and confidentiality may be compromised.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985003558

Experimental design

Jonathan Lazar, ... Harry Hochheiser, in Research Methods in Human Computer Interaction (Second Edition), 2017

3.3.1.2 Advantages and Disadvantages of Within-group Design

Within-group design, in contrast, requires a much smaller sample size. When analyzing the data coming from within-group experiments, we are comparing the performances of the same participants under different conditions. Therefore, the impact of individual differences is effectively isolated and the expected difference can be observed with a relatively smaller sample size. If we change the design of the experiment with 4 conditions and 16 participants from a between-group design into a within-group design, the total number of participants needed would be 16, rather than 64. The benefit of a reduced sample size is an important factor for many studies in the HCI field when qualified participants may be quite difficult to recruit. It may also help reduce the cost of the experiments when financial compensation is provided.

Within-group designs are not free of limitations. The biggest problem with a within-group design is the possible impact of learning effects. Since the participants complete the same types of task under multiple conditions, they are very likely to learn from the experience and may get better in completing the tasks. For instance, suppose we are conducting a within-group experiment that evaluates two types of ATM: one with a button interface and one with a touch-screen interface. The task is to withdraw money from an existing account. If the participant first completes the task using the ATM with the button interface, the participant gains some experience with the ATM interface and its functions. Therefore, the participant may perform better when subsequently completing the same tasks using the ATM with the touch-screen interface. If we do not isolate the learning effect, we might draw a conclusion that the touch-screen interface is better than the button interface when the observed difference is actually due to the learning effect. Normally, the potential bias of the learning effect is the biggest concern of experimenters when considering adopting a within-group design. A Latin Square Design is commonly adopted to control the impact of the learning effect.

Another potential problem with within-group designs is fatigue. Since there are multiple conditions in the experiment, and the participants need to complete one or more tasks under each condition, the time it takes to complete the experiment may be quite long and participants may get tired or bored during the process. Contrary to the learning effect, which favors conditions completed toward the end of the experiment, fatigue negatively impacts on the performance of conditions completed toward the end of the experiment. For instance, in the ATM experiment, if the touch-screen interface is always tested after the button interface, we might draw a conclusion that the touch-screen interface is not as effective as the button interface when the observed difference is actually due to the participants' fatigue. We might fail to identify that the touch-screen interface is better than the button interface because the impact of fatigue offsets the gain of the touch-screen interface. Similarly, the potential problem of fatigue can also be controlled through the adoption of the Latin Square Design.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053904000030

What are the effects of a small sample size?

Too small a sample may prevent the findings from being extrapolated, whereas too large a sample may amplify the detection of differences, emphasizing statistical differences that are not clinically relevant.

What are the disadvantages of a small sample size?

Sample size limitations A small sample size may make it difficult to determine if a particular outcome is a true finding and in some cases a type II error may occur, i.e., the null hypothesis is incorrectly accepted and no difference between the study groups is reported.