outlier detection in machine learning

 
Outlier Detection

What is an outlier?

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.
Knowing all about Outliers in Machine Learning
Observed variables often contain outliers that have unusually large or small values when compared with others in a data set. Some data sets may come from homogeneous groups; others from heterogeneous groups that have different characteristics regarding a specific variable, such as height data not stratified by gender. Outliers can be caused by incorrect measurements, including data entry errors, or by coming from a different population than the rest of the data. If the measurement is correct, it represents a rare event. Two aspects of an outlier can be considered;
  1. Outliers generally serve to increase error variance and reduce the power of statistical tests.
  2. If non-randomly distributed, they can decrease normality, altering the odds of making both Type I and Type II errors.
  3. They can seriously bias or influence estimates that may be of substantive interest.
The second aspect of outliers is that they can provide useful information about data when we look into an unusual response to a given study. They could be the extreme values sitting apart from the majority of the data regardless of distribution assumptions. 
In summary, there are two reasons for detecting outliers. The first reason is to find outliers which influence assumptions of a statistical test, for example, outliers violating the normal distribution assumption in an ANOVA test, and deal with them properly in order to improve statistical analysis. This could be considered as a preliminary step for data analysis.The second reason is to use the outliers themselves for the purpose of obtaining certain critical information about the data as was shown in the above examples.
There are two kinds of outlier detection methods: formal tests and informal tests. Formal and informal tests are usually called tests of discordancy and outlier labeling methods, respectively. Most formal tests need test statistics for hypothesis testing. They are usually based on assuming some well-behaving distribution, and test if the target extreme value is an outlier of the distribution, i.e., weather or not it deviates from the assumed distribution. Some tests are for a single outlier and others for multiple outliers. On the other hand, most outlier labeling methods, informal tests, generate an interval or criterion for outlier detection instead of hypothesis testing, and any observations beyond the interval or criterion is considered as an outlier. There are two reasons for using an outlier labeling method. One is to find possible outliers as a screening device before conducting a formal test. The other is to find the extreme values away from the majority of the data regardless of the distribution. For a large data set that is statistically problematic, e.g., when it is difficult to identify the distribution of the data or transform it into a proper distribution such as the normal distribution, labeling methods can be used to detect outliers. 

When data are highly skewed or in other respects depart from a normal distribution, transformations to normality is a common step in order to identify outliers using a method which is quite effective in a normal distribution. Such a transformation could be useful when the identification of outliers is conducted as a preliminary step for data analysis and it helps to make possible the selection of appropriate statistical procedures for estimating and testing as well. However, if an outlier itself is a primary concern in a given study, a transformation of the data could affect our ability to identify outliers.


OUTLIER LABELING METHOD

STANDARD DEVIATION (SD) METHOD

The simple classical approach to screen outliers is to use the SD (Standard Deviation) method. It is defined as 

  • 2 SD Method: x ± 2 SD 
  • 3 SD Method: x ± 3 SD
where the mean is the sample mean and SD is the sample standard deviation. The observations outside these intervals may be considered as outliers.

Z-SCORE

Another method that can be used to screen data for outliers is the Z-Score, using the mean and standard deviation. The basic idea of this rule is that if X follows a normal distribution N, then Z follows a standard normal distribution, N (0, 1), and Z-scores that exceed 3 in absolute value are generally considered as outliers. 

MODIFIED Z-SCORE


Two estimators used in the Z-Score, the sample mean and sample standard deviation, can be affected by a few extreme values or by even a single extreme value. To avoid this problem, the median and the median of the absolute deviation of the median (MAD) are employed in the modified Z-Score instead of the mean and standard deviation of the sample.

TUKEY’S METHOD (BOXPLOT) 


Tukey’s (1977) method, constructing a boxplot, is a well-known simple graphical tool to display information about continuous univariate data, such as the median, lower quartile, upper quartile,lower extreme, and upper extreme of a data set. It is less sensitive to extreme values of the data than the previous methods using the sample mean and standard variance because it uses quartiles which are resistant to extreme values. The rules of the method are as follows:
  1. The IQR (Inter Quartile Range) is the distance between the lower (Q1) and upper (Q3) quartiles.
  2. Inner fences are located at a distance 1.5 IQR below Q1 and above Q3 [Q1-1.5 IQR, Q3+1.5IQR].
  3. Outer fences are located at a distance 3 IQR below Q1 and above Q3 [Q1-3 IQR, Q3+3 IQR].
  4. A value between the inner and outer fences is a possible outlier. An extreme value beyond the outer fences is a probable outlier.


While previous methods are limited to mound-shaped and reasonably symmetric data such as the normal distribution, Tukey’s method is applicable to skewed or non mound-shaped data since it makes no distributional assumptions and it does not depend on a mean or standard deviation. However, Tukey’s method may not be appropriate for a small sample size. 


ADJUSTED BOXPLOTS

Although the boxplot proposed by Tukey (1977) may be applicable for both symmetric and skewed data, the more skewed the data, the more observations may be detected as outliers. This results from the fact that this method is based on robust measures such as lower and upper quartiles and the IQR without considering the skewness of the data. Vanderviere and Huber (2004) introduced an adjusted boxplot taking into account the medcouple (MC) 32, a robust measure of skewness for a skewed distribution.

MEDIAN RULE


The median is a robust estimator of location having an approximately 50% breakdown point. It is the value that falls exactly in the center of the data when the data are arranged in order. For a skewed distribution like income data, the median is often used in describing the average of the data. The median and mean have the same value in a symmetrical distribution. 

No comments:

Post a Comment