Explained: Key Mathematic Principles for Performance Testers
J.D. Meier, Carlos Farre,
, Scott Barber
Members of software development teams, developers, testers, administrators, and managers alike need to know how to apply mathematics and interpret statistical data in order to do their jobs effectively. Performance analysis and reporting are particularly math-intensive.
It is critical that mathematical and statistical concepts in performance testing be understood so that correct performance-testing analysis and reporting can be done.
- Exemplar Data Sets
- Normal Values
- Standard Deviations
- Uniform Distributions
- Normal Distributions
- Statistical Significance
- Statistical Equivalence
- Statistical Outliers
- Confidence Intervals.
- Learn the uses, meanings of, and concepts underlying common mathematical and statistical principles as they apply to performance test analysis and reporting.
Even though there is a need to understand many mathematical and statistical concepts, many software developers, testers, and managers either do not have strong backgrounds in or do not enjoy mathematics and statistics. This leads to significant misrepresentations
and misinterpretation of performance-testing results. The information presented in this article is not intended to replace formal training in these areas, but rather to provide common language and commonsense explanations for mathematical and statistical operations
that are valuable to understanding performance testing.
Exemplar Data Sets
This article refers to three exemplar data sets for the purposes of illustration, namely.
- Data Set A
- Data Set B
- Data Set C
Data Sets Summary
The following is a summary of Data Sets A, B, and C.
Summary of Data Sets A, B, and C
Data Set A
Data Set A
100 total data points, distributed as follows:
- 5 data points have a value of 1.
- 10 data points have a value of 2.
- 20 data points have a value of 3.
- 30 data points have a value of 4.
- 20 data points have a value of 5.
- 10 data points have a value of 6.
- 5 data points have a value of 7.
Data Set B
Data Set B
100 total data points, distributed as follows:
- 80 data points have a value of 1.
- 20 data points have a value of 16.
Data Set C
Data Set C
100 total data points, distributed as follows:
- 11 data points have a value of 0.
- 10 data points have a value of 1.
- 11 data points have a value of 2.
- 13 data points have a value of 3.
- 11 data points have a value of 4.
- 11 data points have a value of 5.
- 11 data points have a value of 6.
- 12 data points have a value of 7.
- 10 data points have a value of 8.
An average ? also known as an arithmetic mean
, or mean
for short ? is probably the most commonly used, and most commonly misunderstood, statistic of all. To calculate an average, you simply add up all the numbers and divide the sum by the quantity
of numbers you just added. What seems to confound many people the most when it comes to performance testing is that, in this example, Data Sets A, B, and C each have an average of exactly 4. In terms of application response times, these sets of data have extremely
different meanings. Given a response time goal of 5 seconds, looking at only the average of these sets, all three seem to meet the goal. Looking at the data, however, shows that none of the data sets is composed only of data that meets the goal, and that Data
Set B probably demonstrates some kind of performance anomaly. Use caution when using averages to discuss response times and, if at all possible, avoid using averages as the only reported statistic. When reporting averages, it is a good idea to include the
sample size, minimum value, maximum value, and standard deviation for the data set.
Few people involved with developing software are familiar with percentiles. A
is a straightforward concept that is easier to demonstrate than define. For example, to find the 95th percentile value for a data set consisting of 100 page-response-time measurements, you would sort the measurements from largest to smallest
and then count down six data points from the largest. The 6th data point value represents the 95th percentile of those measurements. For the purposes of response times, this statistic is read “95 percent of the simulated users experienced a response time of
the 6th-slowest value
or less for this test scenario.”
It is important to note that percentile statistics can only stand alone when used to represent data that is uniformly or normally distributed with an acceptable number of outliers (see “Statistical Outliers” below). To illustrate this point, consider the exemplar
data sets. The 95th percentile of Data Set B is 16 seconds. Obviously, this does not give the impression of achieving the 5-second response time goal. Interestingly, this can be misleading as well because the 80th percentile value of Data Set B is 1 second.
With a response time goal of 5 seconds, it is likely unacceptable to have any response times of 16 seconds, so in this case neither of these percentile values represent the data in a manner that is useful to summarizing response time.
Data Set A is a normally distributed data set that has a 95th percentile value of 6 seconds, an 85th percentile value of 5 seconds, and a maximum value of 7 seconds. In this case, reporting either the 85th or 95th percentile values represents the data in a
manner where the assumptions a stakeholder is likely to make about the data are likely to be appropriate to the data.
is simply the middle value in a data set when sequenced from lowest to highest. In cases where there is an even number of data points and the two center values are not the same, some disciplines suggest that the median is the average of the
two center data points, while others suggest choosing the value closer to the average of the entire set of data. In the case of the exemplar data sets, Data Sets A and B have median values of 4, and Data Set C has a median value of 1.
A normal value
is the single value that occurs most often in a data set. Data Set A has a normal value of 4, Data Set B has a normal value of 3, and Data Set C has a normal value of 1.
By definition, one standard deviation
is the amount of variance within a set of measurements that encompasses approximately the top 68 percent of all measurements in the data set; in other words, knowing the standard deviation of your data set tells
you how densely the data points are clustered around the mean. Simply put, the smaller the standard deviation, the more consistent the data. To illustrate, the standard deviation of Data Set A is approximately 1.5, the standard deviation of Data Set B is approximately
6.0, and the standard deviation of Data Set C is approximately 2.6.
A common rule in this case is: “Data with a standard deviation greater than half of its mean should be treated as suspect. If the data is accurate, the phenomenon the data represents is not displaying a normal distribution pattern.” Applying this rule, Data
Set A is likely to be a reasonable example of a normal distribution; Data Set B may or may not be a reasonable representation of a normal distribution; and Data Set C is undoubtedly not a reasonable representation of a normal distribution.
Uniform distributions ? sometimes known as linear distributions
? represent a collection of data that is roughly equivalent to a set of random numbers evenly spaced between the upper and lower bounds. In a uniform distribution, every number in the data
set is represented approximately the same number of times. Uniform distributions are frequently used when modeling user delays, but are not common in response time results data. In fact, uniformly distributed results in response time data may be an indication
of suspect results.
Also known as bell curves
, normal distributions
are data sets whose member data are weighted toward the center (or median value). When graphed, the shape of the “bell” of normally distributed data can vary from tall and narrow to short and squat,
depending on the standard deviation of the data set. The smaller the standard deviation, the taller and more narrow the “bell.” Statistically speaking, most measurements of human variance result in data sets that are normally distributed. As it turns out,
end-user response times for Web applications are also frequently normally distributed.
Mathematically calculating statistical significance, or reliability, based on sample size is a task that is too arduous and complex for most commercially driven software-development projects. Fortunately, there is a commonsense approach that is both efficient
and accurate enough to identify the most significant concerns related to statistical significance. Unless you have a good reason to use a mathematically rigorous calculation for statistical significance, a commonsense approximation is generally sufficient.
In support of the commonsense approach described below, consider this excerpt from a StatSoft, Inc. (http://www.statsoftinc.com) discussion on the topic:
There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really ‘significant.’ That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary.
Typically, it is fairly easy to add iterations to performance tests to increase the total number of measurements collected; the best way to ensure statistical significance is simply to collect additional data if there is any doubt about whether or not the collected
data represents reality. Whenever possible, ensure that you obtain a sample size of at least 100 measurements from at least two independent tests.
Although there is no strict rule about how to decide which results are statistically similar without complex equations that call for huge volumes of data that commercially driven software projects rarely have the time or resources to collect, the following
is a reasonable approach to apply if there is doubt about the significance or reliability of data after evaluating two test executions where the data was expected to be similar. Compare results from at least five test executions and apply the rules of thumb
below to determine whether or not test results are similar enough to be considered reliable:
- If more than 20 percent (or one out of five) of the test-execution results appear not to be similar to the others, something is generally wrong with the test environment, the application, or the test itself.
- If a 90th percentile value for any test execution is greater than the maximum or less than the minimum value for any of the other test executions, that data set is probably not statistically similar.
- If measurements from a test are noticeably higher or lower, when charted side-by-side, than the results of the other test executions, it is probably not statistically similar.
- If one data set for a particular item (e.g., the response time for a single page) in a test is noticeably higher or lower, but the results for the data sets of the remaining items appear similar, the test itself is probably statistically similar (even though
it is probably worth the time to investigate the reasons for the difference of the one dissimilar data set.
The method above for determining statistical significance actually is applying the principle of statistical equivalence. Essentially, the process outlined above for determining statistical significance could be restated as “Given results data from multiple
tests intended to be equivalent, the data from any one of those tests may be treated as statistically significant if that data is statistically equivalent to 80 percent or more of all the tests intended to be equivalent.” Mathematical determination of equivalence
using such formal methods as chi-squared and t-tests are not common on commercial software development projects. Rather, it is generally deemed acceptable to estimate equivalence by using charts similar to those used to determine statistical significance.
From a purely statistical point of view, any measurement that falls outside of three standard deviations, or 99 percent, of all collected measurements is considered an
. The problem with this definition is that it assumes that the collected measurements are both statistically significant and distributed normally, which is not at all automatic when evaluating performance test data.
For the purposes of this explanation, a more applicable definition of an outlier from a StatSoft, Inc. (http://www.statsoftinc.com) is the following:
Outliers are atypical, infrequent observations: data points which do not appear to follow the distribution of the rest of the sample. These may represent consistent but rare traits, or be the result of measurement errors or other anomalies which should not
Note that this (or any other) description of outliers only applies to data that is deemed to be a statistically significant sample of measurements. Without a statistically significant sample, there is no generally acceptable approach to determining the difference
between an outlier and a representative measurement.
Using this description, results graphs can be used to determine evidence of outliers — occasional data points that just don’t seem to belong. A reasonable approach to determining if any apparent outliers are truly atypical and infrequent is to re-execute the
tests and then compare the results to the first set. If the majority of the measurements are the same, except for the potential outliers, the results are likely to contain genuine outliers that can be disregarded. However, if the results show similar potential
outliers, these are probably valid measurements that deserve consideration.
After identifying that a dataset appears to contain outliers, the next question is, how many outliers can be dismissed as “atypical infrequent observations?”
There is no set number of outliers that can be unilaterally dismissed, but rather a maximum percentage of the total number of observations. Applying the spirit of the two definitions above, a reasonable conclusion would be that up to 1 percent of the total
values for a particular measurement that are outside of three standard deviations from the mean are significantly atypical and infrequent enough to be considered outliers.
In summary, in practice for commercially driven software development, it is generally acceptable to say that values representing less than 1 percent of all the measurements for a particular item that are at least three standard deviations off the mean are candidates
for omission in results analysis if (and only if) identical values are not found in previous or subsequent tests. To express the same concept in a more colloquial way: obviously rare and strange data points that can’t immediately be explained, account for
a very small part of the results, and are not identical to any results from other tests are probably outliers.
A note of caution: identifying a data point as an outlier and excluding it from results summaries does not imply ignoring the data point. Excluded outliers should be tracked in some manner appropriate to the project context in order to determine, as more tests
are conducted, if a pattern of concern is identified in what by all indications are outliers for individual tests.
Because determining levels of confidence in data is even more complex and time-consuming than determining statistical significance or the existence of outliers, it is extremely rare to make such a determination during commercial software projects. A confidence
interval for a specific statistic is the range of values around the statistic where the ‘true’ statistic is likely to be located within a given level of certainty.
Because stakeholders do frequently ask for some indication of the presumed accuracy of test results ? for example, what is the confidence interval for these results? ? another commonsense approach must be employed.
When performance testing, the answer to that question is directly related to the accuracy of the model tested. Since in many cases the accuracy of the model cannot be reasonably determined until after the software is released into production, this is not a
particularly useful dependency. However, there is
a way to demonstrate a confidence interval in the results.
By testing a variety of scenarios, including what the team determines to be “best,” “worst,” and “expected” cases in terms of the measurements being collected, a graphical depiction of a confidence interval can be created, similar to the one below.
In this graph, a dashed line represents the performance goal, and the three curves represent the results from the worst-case (most performance-intensive), best-case (least performance-intensive), and expected-case user community models. As one would expect,
the blue curve from the expected case falls between the best- and worst-case curves. Observing where these curves cross the red line, one can see how many users can access the system in each case while still meeting the stated performance goal. If the team
is 95-percent confident (by their own estimation) that the best- and worst-case user community models are truly best- and worst-case, this chart can be read as follows: the tests show, with 95-percent confidence, that between 100 and 200 users can access the
system while experiencing acceptable performance.
Although a confidence interval of between 100 and 200 users might seem quite large, it is important to note that without empirical data representing the actual production usage, it is unreasonable to expect higher confidence in results than there is in the
models that generate those results. The best that one can do is to be 100-percent confident that the test results accurately represent the model being tested.