Multivariate outlier
Overview
editIn statistics, an outlier refers to a case that deviates to a notable extent from the typical range or pattern of observations exhibited for other cases.
It's important to distinguish between univariate, bivariate, and multivariate outliers.
Univariate outliers only matter, in the context of MLR, in so much as they contribute to bivariate and/or multivariate outliers, although normally distributed variables enhance the solution.
Bivariate outliers (check scatterplots) matter if they influence the linear lines of best fit. If unsure, remove the outlying data points and recalculate the correlation. Does it make any difference? If not, the bivariate outlier may as well be retained. If there is a difference, decide which sample to use.
It is also possible to have multivariate outliers (MVOs), which are cases with an unusual combination of scores on different variables.
An assumption of many multivariate statistical analyses, such as MLR, is that there are no multivariate outliers.
MVOs can be detected by calculating and examining Mahalanobis' Distance (MD) or Cook's D. These statistics can usually be requested through a statistical analysis software program, as part of the options or save menus in the linear regression function. Selecting these options will save a MD and D value in the data file for each case. These values indicate how extreme or influential each case is with regard to the combination of variables included in the MLR design.
If there are MVO test statistics which exceed critical values, then caution should be used in interpreting results - they may be in part influenced some particular cases. If you the MD and/or D indicate the presence of MVOs, then:
- Sort the data file by descending order of the MD value
- Closely examine the cases with MVO outlier test statistics that exceed critical values.
- In particular, check the values for these cases for each of the variables involved in the analysis.
- Can you work out why these cases appear to be MVOs? (What is each case particularly high or low on? How severely deviating is each case's results from typical responses?)
- Try the inferential analysis (e.g., MLR) with and without these cases. What difference does it make to the results?
- If no difference, then you may as well include the cases.
- If it does make a noticeable difference to the results when the MVO cases are removed, then consider which solution is more valid.
- If in doubt, perhaps present both sets of results.
Mahalanobis' Distance
edit- Calculate the MD for each case (by conducting an MLR and checking this option from the Scores dialog box)
- From the results table for residual statistics, check the maximum MD
- If the max. value is greater than the critical χ2 value (where df = the number of predictors with p = .001 (for MLR)) then there could be at least one case which is a MVO.
- Check a χ2 table (see below).
- Investigate further by consulting the data file - note that each case should now have an automatically calculated MD column (far right in SPSS)
- Sort the data in descending order by the MD (r. click on the column heading and sort)
- Identify which cases have a MD which exceeds the critical χ2 value
- Examine these flagged cases - they have an unusual combination of values for the variables involved in the MLR. Try to work out why e.g. the relationship amongst the values may be in the opposite direction to that for most respondents.
- The key question is - are these cases having undue influence on the MLR? To find out, try running the MLR with and without the MVO cases - does it make any difference?
- If not, keep the cases.
- If their removal makes a difference, then consider reporting the results without the MVOs - or perhaps both sets of results.
- To remove a case from an analysis, first backup the data file, then delete either the values or cases which are MVOs, re-run the MLR analysis, and compare the results with the previous MLR. Depending on the difference, decide which analysis to include in the results.
Degrees of freedom (df) | χ2 value[1] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1
|
10.83 | ||||||||||
2
|
13.82 | ||||||||||
3
|
16.27 | ||||||||||
4
|
18.47 | ||||||||||
5
|
20.52 | ||||||||||
6
|
22.46 | ||||||||||
7
|
24.32 | ||||||||||
8
|
26.12 | ||||||||||
9
|
27.88 | ||||||||||
10
|
29.59 | ||||||||||
p value (Probability)
|
0.001 |
Cook's distance
editCook's distance (Cook's D) provides another test statistic for examining multivariate outliers. The higher the D, the more influential the point is. The lowest value that D can assume is zero.
There are varying criteria for what cut-off to use for identifying MVOs using Cook's D (i.e., is D for any case above these critical values? If yes, there may be a problem with these case(s) being overly influential):
- Critical value = 1
- Critical value = 3 times the mean D value
- Critical value = 4 divided by n (where n is the sample size)
For more info, see Cook's distance (Wikipedia)
References
edit- ↑ Chi-Squared Test Table B.2. Dr. Jacqueline S. McLaughlin at The Pennsylvania State University. In turn citing: R.A. Fisher and F. Yates, Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table IV
External links
edit- SPSS regression diagnostics (UCLA statistical consulting)