Since many of the common statistical tests require the assumption of a normal distribution. SCRAS is designed to test this assumption. The system helps a user to select an appropriate test: Chi-square, Kolmogorov-Smirnov, or Shapiro-Wilk. These are three common statistical tests that can be used to test the assumption of normality. Since Kolmogorov-Smirnov is designed for continuous data, and Shapiro-Wilk is designed specifically to test normality, these tests were chosen to test the normality assumption of a distribution.
The Shapiro-Wilk test can test whether the samples come from a normal population.
In this system, it is used for when the sample size is less than 20 because Shapiro-Wilk performs better than Kolmogorov-Smirnov test when the sample size is below 20 (Shapiro 1965).
The null hypothesis associated with this test is that the samples are drawn from a population of normal distribution. The alternative hypothesis is that the samples do not come from a population of normal distribution. The mean and variance of the distribution are assumed to be unspecified.
The Kolmogorov-Smirnov test can test whether the samples come from a specified distribution. In this system, it is used to check the normal distribution when the sample size is between 20 and 30. Compared with the Shapiro-Wilk test, the Kolmogorov-Smirnov test is easier to calculate and computation time is shorter.
The null and alternative hypotheses are the same as in the Shapiro-Wilk test. The means and variance of the distribution are unspecified.
Let S(x) be the empirical distribution function based on the ordered observations X1, X1, X2, …, Xn Specifically,
S(x) = the proportion of sample observations less than or equal to x.
3.2 Paired Tests
Statistical tests fall into two classifications: parametric and non- parametric. The parametric tests are applied to a problem whose distribution is specified except for the values of a finite number of parameters. However, nonparametric tests do not assume that the distribution of a underlying population is known. We will employ parametric and nonparametric tests in the two-sample case in this section.
If two heuristic algorithms were computed with respect to the same problem set, the observations are paired. Three test procedures for analyzing paired observations are provided in this system: paired t-test, Wilcoxon signed rank test, and paired-normal test. The statistical test that will be used depends on the underlying distribution and sample sizes of the data.
Two-Sample Parametric Tests
The paired t-test is used to test whether two means are different when observations are paired and when it can be assumed that the differences are normally distributed. The data consist of n (£
30) pairs of observations of the form (xi, yi), i = 1, 2, …, n. It is assumed that the n pairs are independent of each other. But, within a pair, the two observations are not independent. The differences di = xi - yi, i = 1, 2, …, n, are assumed to be a random sample from a normal distribution. The null hypothesis is that the two algorithms have equal mean costs, and the alternative hypothesis is that the two algorithms do not have equal mean costs.
Let
be denote the sample mean of the differences and Sd denote the sample standard deviation of the differences.
The test statistic is defined as follows:
The test statistic has a t-distribution with n-1 degrees of freedom under the null hypothesis. The null hypothesis is rejected at the a
significance level if
depending on whether the alternative hypothesis is Ha1 : E[x] ¹
E[y], or Ha2 : E[x] > E[y], or Ha3 : E[x] < E[y]. Here t(a
) is chosen, where T is a random variable having a t-distribution with n-1 degrees of freedom so that P(T > t(a
)) = a
. This value can be obtained using tables with n-1 degrees of freedom.
The paired normal-test is used to test whether two means are different when the
observations are paired and the sample size is larger than 30. The procedure is the same as the paired t-test except that the critical region is changed. The test statistic now has an approximate standard normal distribution if the null hypothesis is true. The null hypothesis is rejected at the a
significance level if
depending upon whether the alternative hypothesis is Ha1 : E[x] ¹
E[y], or Ha2 : E[x] > E[y], or Ha3 : E[x] < E[y]. Here Z(a
) is chosen, where Z is a random variable having a normal distribution with mean 0 and variance 1, so that P(Z > Z(a
)) = a
. This value can be obtained using the standard normal tables.
Two-Sample Nonparametric-Tests
For comparing two samples when the paired differences cannot be assumed to be a sample from a normal distribution and the sample size is less than 30, the Wilcoxon signed rank test is used. It can determine whether the two populations from which the samples are selected have the same median. The data consist of n pairs of observations (xi, yi), i = 1, 2, …, n, and the observed differences xi - yi for i = 1, 2, …, n, which are ranked in the increasing order of magnitude. If any di = 0, the corresponding observation is dropped from the analysis and n is decreased by one. If several pairs have absolute differences that are equal to each other, the average of the ranks is assigned to each of these pairs.
It is assumed that the differences d1, d2, …, dn are mutually independent and come from a continuous population that is symmetric about 0. The null hypothesis is that the two algorithms have equal median costs, and the alternative hypothesis is that the two algorithms do not have equal median costs. The critical value W(a
) can be approximated by
where Z(a
) is the standard normal fractile and is determined in such a manner that a proportion of a
of the area is to the left of Z(a
). Let w be the sum of the signed ranks, Ri, where Ri, is defined as follows:
The null hypothesis is rejected at the a
level of significance if
w > W(1 - a
/2) or w < W(a/2),
w < W(a) or
w > W(1-a),
depending upon whether the alternative hypothesis is Ha1 : E[x] ¹
E[y], or Ha2 : E[x] < E[y], or Ha3 : E[x] > E[y].
3.3 Three or More Samples Tests
For comparing the efficiency of three of more heuristic algorithms, the data are derived from the execution of several algorithms on a same problem set. There are two options in this system: parametric two-way analysis of variance and the Friedman test. If the null hypothesis that all of the algorithms have the same mean cost is rejected, a multiple comparison procedure may be used to determine which algorithms were significantly different. Duncan’s multiple range test is used in the parametric case. A different multiple comparisons method is used following the Friedman test.
Parametric Two-Way Analysis of Variance Test
The parametric two-way analysis of variance test is used to compare three or more different heuristic algorithms if the underlying populations can be assumed to be normal. The null hypothesis is that all (three or more) of the algorithms have equal mean costs, and the alternative hypothesis is that all (three or more) of the algorithms do not have equal mean costs under the assumption of normal distributions with a common variance.
The analysis of variance (ANOVA) table for comparing mean costs for k algorithms appears as follows:
|
Source of
Variance |
Degrees of Freedom |
Sum of Squares |
Mean Square |
F-ratio |
Algorithm
(Treatments)
Problems
(Blocks)
Error
Total |
k-1
n-1
(n-1)(k-1)
nk-1 |
SSTR
SSB
SSE
SST |
MSTR = SSTR/(k-1)
MSB = SSB/(n-1)
MSE = SSE/(n-1)(k-1) |
MSTR/ MSE
MSB / MSE |
Where
SSE=SST – SSB - SSTR,
Xij = observation using problem i, algorithm j,
Bi = sum of observations using problem i,
Tj = sum of observations using algorithm j.
The null hypothesis is rejected at the a
significance level if
F = MSTR / MSE ³
F(a
, k-1, (n-1)(k-1))
where F(a
, k-1, (n-1)(k-1)) is the upper a
percentile of the F-distribution with degrees of freedom = (k-1, (n-1)(k-1)).
A significant F-ratio indicates that there are some differences between algorithms. The Duncan’s multiple range test is used to compare individual algorithms if the parametric two-way analysis of variance test results in rejection of the null hypothesis. The procedure of Duncan’s multiple range test is as follows:
Step 1. The mean costs of algorithms are ordered from the smallest to the largest.
Step 2. The observed difference between two means is compared to the least significance range
where a
is the significance level, a is the number of algorithms, and f is the number of degrees of freedom for error. The value, r(a
, p, f), can be obtained from Duncan’s table of significant ranges, starting with the largest versus the smallest, which would be compared with R(a). Next, the difference between the largest and the second smallest is compared with R(a-1). These comparisons are performed until all means have been compared with the largest mean. Finally, the difference between the second the largest and the smallest is compared against R(a-1).
Step 3. The two means are declared to be significantly different if the observed difference between two means is larger than the corresponding least significant range. No two means are regarded as significantly different if the two means involved fall between two other means that do not differ significantly (Montgomery 1984).
Friedman Test
The Friedman test is a nonparametric counterpart of the parametric two-way analysis of variance test and is used to test whether three or more different heuristic algorithms’ median costs are equal. The underlying populations do not have to be normal. The test may also be used only if rank data are available.
The data are set in a randomized complete block design with n problems, each being solved using k algorithms. Within each problem the cost of algorithms are ranked from least to greatest. Next, the ranks of the costs are summed for each algorithm. In the case of ties, average ranks are used. The null hypothesis is that all the algorithms (three or more) have equal median costs, and the alternative hypothesis is that at least one of the algorithms has a different median cost.
Let Rij be the rank (from 1 to k) assigned to method j on problem i. It will equal 1 if it is the lowest value among the algorithms. In the case of ties, average ranks are used. The test statistic is defined as following:
where
for j =1, 2, …, k,

The null hypothesis is rejected at the a
significance level if the value of the test statistic exceeds the 1-a
quantitle of the F-distribution with k-1 and (n-1)(k-1) degrees of freedom.
The multiple comparison approach is used to compare individual algorithm only if the Friedman test results in rejection of the null hypothesis. Algorithms i and j are considered different if the following inequality is satisfied:

where Ri, Rj, Af, and Bf are given previously, and t(
) is a critical value on the t-table using (n-1)(k-1) degrees of freedom (a
/2 = P(t(n-1)(t-1)>T(a
/2))). The value for a
is the same one used in the Friedman test.
Expected Utility Approach
The Friedman test is used to test whether the different heuristic algorithms’ median costs are equal. The multiple comparison method is used to determine which algorithm has a significantly different median cost if the Friedman test is rejected. These procedures are tests of location (specifically, mean or median) and not tests of distribution shape or dispersion. Thus results can sometimes be misleading. The expected utility approach is introduced to help identify the ‘best’ of the heuristic algorithms. The response of the expected utility approach is the value of expected utility function, while the previous analysis is based on the total route distance.
A utility function must be defined before using the expected utility approach. In this system, a utility function, u(x), is used which is constantly risk averse. In particular, we set u(x) = a
- b
etx where x is the percentage deviation from the lower bound. The first two parameters, a
and b
, can be chosen arbitrarily. The third parameter, t, gives a measure of risk aversion for the utility function. A decision maker is constantly risk averse u(x)’’/if u(x)’ is positive (Keeney and Raiffa 1976). The expected utility approach is outlined below as given in Golden and Stewart (Golden and Stewart 1985):
Step 1. Fit a gamma distribution to the histogram of frequency vs. percentage deviation from the lower bound.
Step 2. Select a risk-averse decreasing utility function of the form u(x) = a
- b
etx, where a
, b
, t > 0.
Step 3. Calculate the expected utility for each heuristic algorithm and select the one that yields the largest value.
The parameter, t, must be less than the sample mean divided by the standard deviation for each heuristic algorithm. In SCRAS, we selected a utility function where a
= 600, b
= 100, and t = .05. These values are recommended in Golden and Stewart (1985).
4. Implementation of the Framework
The system structure of the SCRAS is shown in Figure 5. The following descriptions present figures which are screen dumps from some of the more important SCRAS screens:
Figure 5. The system chart of the SCRAS
Figure 6 displays the various heuristic algorithms that can be selected for use in SCRAS. Two parameters, the number of vehicles and the utilization factor, need to be entered before executing . The example shown on the screen indicates that the following two items are entered : the number of vehicles is three, and the utilization factor is .95. The example also shows that the following three algorithms are selected : F-J (Fisher-Jaikumar), C-W(Clarke-Wright), and N-W (Nygard-Walker).
Figure 6. Display the existing heuristic algorithms and provide the required parameters needed to be used to solve problems.
Figure 7 shows the statistical tests that are available in SCRAS. Three types of statistical tests are provided. For test normality, the Shapiro-Wilk test and
Kolmogorov-Smirnov test are provided. For testing the difference between two populations, the tests given are the paired-t test, Wilcoxon signed rank test, and paired-normal test. For testing the difference between more than two populations, The tests given are the parametric two-way analysis of variance and the Friedman test furthermore, the information about how to determine the sample size is supplied.
Figure 7. Display of the statistical tests available in SCRAS.
Figure 8 shows the criteria that SCRAS uses to select an appropriate method of comparison when the user clicks the mouse on the “System” (selected by the system) button in Figure 3.
Figure 8. Show the criteria that SCRAS used to select an appropriate method of comparison when the user clicks the mouse on the "System"
Figure 9 shows the output results and give a statistical interpretation.
SCRAS has been implemented in personal computers running under the Windows version. The language chosen for the implementation was Borland C++.
Figure 9. After performing a statistical test, the user is provided with on-screen assistance in interpreting the results.
5. Conclusions
SCRAS provides an easy way to compare two or more heuristic algorithms. It helps the user to i) determine how many test runs are needed; ii) decide which statistical test should be used; iii) generate new test problems; and iv) decide which algorithm is better under certain conditions by performing a statistical test. Furthermore, the results can be easily compared in a graphical form.
SCRAS significantly reduces algorithm comparison time by providing everything that is necessary for the user to compare algorithms for the vehicle routing problems in one system. It also reduces comparison time by guiding the user in selecting an appropriate statistical test. All steps involved in comparisons can be performed within a few strokes. SCRAS also reduces human mistakes that can occur in file transfer and data format changes. SCRAS was implemented in a GUI window environment. Thus, the user has a menu environment for reference available even while pursuing a statistical analysis in another window.
Our experience in SCRAS is successful. It provides all the functions needed when a newly developed algorithm is to be compared with the other existing heuristic algorithms. The same idea can be applied to the other problem solvers. In most cases, the algorithm comparison task have been reduced from several weeks to a few hours.
References
Bratcher, T.L., M.A. Moran, and W.J. Zimmer. (1970). Tables of Sample Size in Analysis of Variance, Journal of Quality Technology, Vo1. 2, No. 3, 156-164.
Clarke, G. and J.W. Wright. (1964). Scheduling of Vehicles From a Central Depot to a Number of Delivery Points, Operations Research 12, 568-581.
Conover, W.J. (1980. Practical Nonparametric Statistics, 2nd ed. John Wiley and Sons, New York.
Fisher, M. and R. Jaikumar. (1981). A Generalized Asignment Heuristics for Vehicle Routing, Networds 11, 109-124.
Golden, B.L. and W.R. Stewart. (1985). Empirical Analysis of Heuristics, in The Traveling Salesman Problem, E. Lawler, J. Lenstra, A. Rinnooy Kan, and d. Shmoys, eds. John Wiley and Sons, New York, 207-249.
Keeney, R.L. and H. Raiffa. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs, John Wiley and Sons, New York.
Lenstra, J. and A. Rinnooy Kan. (1981). Complexity of Vehicle Routing and Scheduling Problems, Networks 11, 221-227.
Makridakis, S., R.L. Winkler, and D.M. Harde. (1988). ISP User’s Guide, Lincoln Systems Corporation, Westford, MA 01886.
Miller, L.H. (1965). Table of Percentage Points of Kolmogorov Statistics, J. Amer.
Statist. Assoc., 51, 111-121.
Montgomery, D.C. (1984). Design and Analysis of Experiments, 2nd ed. John Wiley and Sons, New York.
Nelson, M.D. (1983). Implementation Techniques for the Vehicle Routing Problem, M.S. Thesis, Department of Computer Science, North Dakota State University.
Neter, J., W. Wasserman, and M.h. Kutner. (1985). Applied Linear Statistical Models, 2nd ed. RICHARD D. IRWIN, Illinois.
Nygard, K.E., P. Juell, and K. Nagesh. (1989). An interactive Decision Support System for Vehicle Routing, in Impact of Recent computer Advances on Operations Research, Ramesh Sharda, Bruce golden, Edward Wasil, Osman Balci, and William Stewart, Eds., Elsevier Science, New York, 361-372.
Nygard, K.E. and R. Walker. (1988). A Vehicle Routing Algorithms Based on Spacefilling curves, Technical Report, department of computer Science and Operations Research, North Dakota State University.
Shapiro, S.S. and M.B. Wilk. (1965). An analysis of Variance Test for Normality (Complete Samples), Biometrika 62, 3 and 4, 591-611.
Solomon, M.M. (1983). Vehicle Routing and Scheduling with Time Window Constraints: Models and Algorithms, Working Paper: 83-42, Northeastern University.
Tiku, M.L. (1965). Laguerre Series Forms of Non-central c
2 and F Distributions, Biometrika 52,415-427.
Tiku, M.L. (1967). Tables of the Power of the F-Test, Journal of the American Statistical Association, Vol. 62, No. 318, 525-539.
Walpole, R.E. and R.H. Myers. (1989). Probability and Statistics for Engineers and Scientists , 4th ed., MacMillan, New York.