Outlier Detection - Python

Welcome to the “Outlier Detection” practical session.

We are using both Pandas (data loading, processing, transformation and manipulation) and Scikit-learn (example data source, ML and statistical analysis)

Outlier Detection

This example illustrates - How to identify outliers. - How to deal with outliers.

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from scipy import stats

# Load the dataset
# For this practical we use the breast cancer data set.  
data = datasets.load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Let's look at the first few rows of the dataframe
print(df)

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0          17.99         10.38  ...          0.4601                  0.11890
1          20.57         17.77  ...          0.2750                  0.08902
2          19.69         21.25  ...          0.3613                  0.08758
3          11.42         20.38  ...          0.6638                  0.17300
4          20.29         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564        21.56         22.39  ...          0.2060                  0.07115
565        20.13         28.25  ...          0.2572                  0.06637
566        16.60         28.08  ...          0.2218                  0.07820
567        20.60         29.33  ...          0.4087                  0.12400
568         7.76         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]

Data Description

This is an analysis of the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. This data set was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin,USA. To create the dataset Dr. Wolberg used fluid samples, taken from patients with solid breast masses and an easy-to-use graphical computer program called Xcyt, which is capable of perform the analysis of cytological features based on a digital scan. The program uses a curve-fitting algorithm, to compute ten features from each one of the cells in the sample, than it calculates the mean value, extreme value and standard error of each feature for the image, returning a 30 real-valuated vector

Attribute Information:

ID number
Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Strategy 1 : Outlier Removal

We will visualizes the original data, including outliers, using a boxplot. Each box in the boxplot represents one column (or feature) in the data, with outliers shown as points that are distant from the main body of the box.

The code then creates a new DataFrame, df_o, which is the same as the original DataFrame df but with all rows containing any outliers removed. The Z-scores for the new DataFrame are then calculated, to confirm that all outliers have been removed.

The Final code block visualizes the new, outlier-free data using another boxplot. This can be compared with the original boxplot to see the effect of removing the outliers.

#### Strategy 1 : Removing Duplicates


# Outlier detection - We will use Z-score function defined in scipy library to detect the outliers
# We will calculate the Z-score for each value in the DataFrame, and identifies any values with a Z-score greater than 3 as outliers.
z = np.abs(stats.zscore(df))
print('\nZ-score array:\n', z)
#print(df)

# Define a threshold to identify an outlier


Z-score array:
      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0       1.097064      2.073335  ...        2.750622                 1.937015
1       1.829821      0.353632  ...        0.243890                 0.281190
2       1.579888      0.456187  ...        1.152255                 0.201391
3       0.768909      0.253732  ...        6.046041                 4.935010
4       1.750297      1.151816  ...        0.868353                 0.397100
..           ...           ...  ...             ...                      ...
564     2.110995      0.721473  ...        1.360158                 0.709091
565     1.704854      2.085134  ...        0.531855                 0.973978
566     0.702284      2.045574  ...        1.104549                 0.318409
567     1.838341      2.336457  ...        1.919083                 2.219635
568     1.808401      1.221792  ...        0.048138                 0.751207

[569 rows x 30 columns]

threshold = 3
#print('\nOutliers:\n', np.where(z > threshold))

# Visualizing the outliers using boxplot
plt.figure(figsize=(20,10))

# df_m = df.melt( )
# g = sns.FacetGrid(df_m, col="variable", height=4, aspect=.5, sharey=False )
# g.map(sns.boxplot, "variable", "value")

sns.boxplot(data=df)
plt.title('Outlier visualization', fontsize=20)
plt.xticks(rotation=90)
#plt.show()

## Removing the outliers

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [Text(0, 0, 'mean radius'), Text(1, 0, 'mean texture'), Text(2, 0, 'mean perimeter'), Text(3, 0, 'mean area'), Text(4, 0, 'mean smoothness'), Text(5, 0, 'mean compactness'), Text(6, 0, 'mean concavity'), Text(7, 0, 'mean concave points'), Text(8, 0, 'mean symmetry'), Text(9, 0, 'mean fractal dimension'), Text(10, 0, 'radius error'), Text(11, 0, 'texture error'), Text(12, 0, 'perimeter error'), Text(13, 0, 'area error'), Text(14, 0, 'smoothness error'), Text(15, 0, 'compactness error'), Text(16, 0, 'concavity error'), Text(17, 0, 'concave points error'), Text(18, 0, 'symmetry error'), Text(19, 0, 'fractal dimension error'), Text(20, 0, 'worst radius'), Text(21, 0, 'worst texture'), Text(22, 0, 'worst perimeter'), Text(23, 0, 'worst area'), Text(24, 0, 'worst smoothness'), Text(25, 0, 'worst compactness'), Text(26, 0, 'worst concavity'), Text(27, 0, 'worst concave points'), Text(28, 0, 'worst symmetry'), Text(29, 0, 'worst fractal dimension')])

df_o = df[(z < threshold).all(axis=1)]

# Checking if the outliers have been removed
z = np.abs(stats.zscore(df_o))
print('\nAfter removing outliers:\n', np.where(z > threshold))

# Visualizing the data without outliers


After removing outliers:
 (array([  6,  10,  10,  13,  17,  17,  18,  19,  19,  21,  25,  29,  29,
        51,  51,  52,  63,  63,  97, 121, 122, 125, 134, 134, 134, 134,
       135, 137, 137, 137, 141, 141, 141, 146, 155, 167, 176, 176, 178,
       178, 182, 202, 202, 202, 205, 205, 207, 207, 210, 210, 210, 210,
       212, 212, 216, 217, 217, 229, 229, 229, 229, 229, 229, 232, 255,
       255, 255, 257, 257, 257, 281, 284, 314, 316, 316, 316, 316, 316,
       316, 333, 335, 335, 363, 381, 388, 388, 402, 402, 405, 405, 406,
       408, 415, 421, 421, 421, 421, 425, 440, 440, 453, 453, 453, 460,
       460, 471, 471, 471, 491, 491, 491, 491, 491, 491, 492, 492, 492]), array([28, 25, 29, 23,  8, 28, 23, 25, 29, 25, 25, 10, 12, 15, 25, 18, 10,
       13, 29, 19, 19,  8, 10, 12, 13, 17, 23,  0,  3, 23, 10, 12, 13, 14,
       14, 28, 12, 17, 18, 28, 13, 16, 19, 29, 11, 14, 16, 26,  6, 10, 13,
       16, 26, 29, 12,  5, 19,  3, 10, 12, 13, 22, 23, 14, 10, 12, 13, 10,
       12, 13, 18, 18, 12,  7, 10, 12, 13, 22, 23, 14,  6,  7,  8, 11, 15,
       19, 15, 29, 15, 19, 14, 11, 16, 15, 16, 17, 19, 28,  9, 19,  4, 14,
       18, 11, 17, 14, 15, 16,  5,  6,  7, 12, 13, 17, 10, 12, 13]))

plt.figure(figsize=(20,10))
sns.boxplot(data=df_o)
plt.title('After outlier removal', fontsize=20)
plt.xticks(rotation=90)
#plt.show()

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [Text(0, 0, 'mean radius'), Text(1, 0, 'mean texture'), Text(2, 0, 'mean perimeter'), Text(3, 0, 'mean area'), Text(4, 0, 'mean smoothness'), Text(5, 0, 'mean compactness'), Text(6, 0, 'mean concavity'), Text(7, 0, 'mean concave points'), Text(8, 0, 'mean symmetry'), Text(9, 0, 'mean fractal dimension'), Text(10, 0, 'radius error'), Text(11, 0, 'texture error'), Text(12, 0, 'perimeter error'), Text(13, 0, 'area error'), Text(14, 0, 'smoothness error'), Text(15, 0, 'compactness error'), Text(16, 0, 'concavity error'), Text(17, 0, 'concave points error'), Text(18, 0, 'symmetry error'), Text(19, 0, 'fractal dimension error'), Text(20, 0, 'worst radius'), Text(21, 0, 'worst texture'), Text(22, 0, 'worst perimeter'), Text(23, 0, 'worst area'), Text(24, 0, 'worst smoothness'), Text(25, 0, 'worst compactness'), Text(26, 0, 'worst concavity'), Text(27, 0, 'worst concave points'), Text(28, 0, 'worst symmetry'), Text(29, 0, 'worst fractal dimension')])

print(df.shape)

(569, 30)

print(df_o.shape)

(495, 30)

mean compactness mean concavity mean concave points mean symmetry
0 0.27760 0.30010 0.14710 0.2419
1 0.07864 0.08690 0.07017 0.1812
2 0.15990 0.19740 0.12790 0.2069
3 0.28390 0.24140 0.10520 0.2597
4 0.13280 0.19800 0.10430 0.1809
.. … … … …
564 0.11590 0.24390 0.13890 0.1726
565 0.10340 0.14400 0.09791 0.1752
566 0.10230 0.09251 0.05302 0.1590
567 0.27700 0.35140 0.15200 0.2397
568 0.04362 0.00000 0.00000 0.1587

Strategy 2 : Winsorize Method

In the Winsorize Method, we limit outliers with an upper and lower limit. We will set the limits. We will make our upper and lower limits for data our new maximum and minimum points.

This data set consists of 345 observations and 6 predictors representing the blood test result liver disorder status of 345 patients. The three predictors are mean corpuscular volume (MCV), alkaline phosphotase (ALKPHOS), alamine aminotransferase (SGPT), aspartate aminotransferase (SGOT), gamma-glutamyl transpeptidase (GAMMAGT), and the number of alcoholic beverage drinks per day (DRINKS).

For this practical we use the BUPA Liver data set.

You can download the dataset through the link:

bupa.csv

#### Strategy 2 : Winsorize Method


df_bupa = pd.read_csv('bupa.csv')
df_bupa = df_bupa.dropna()

sns.boxplot( x = df_bupa['drinks'] )
#sns.distplot( x = df_bupa['drinks'], bins = 10, kde = False )


## For outliers, 
##   our upper limit is 0 and 
##   our lower limit is 12.
##
## For the Winsorize Method, we have to import winsorize from Scipy. 
## We need boundaries to apply winsorize. 
## We will limit our data between 53 and 63. These values are within limits for outliers. 
## We need precise points of these values in the percentile and we can use the quantile method of Pandas.


from scipy.stats.mstats import winsorize

print( "Lower Limit : " + str(df_bupa['drinks'].quantile(0.05)))

Lower Limit : 0.5

print( "Upper Limit : " + str(df_bupa['drinks'].quantile(0.95)))

Upper Limit : 10.0

Lower Limit : 0.5
Upper Limit : 10.0

We will create a new variable with Winsorize Method. To implement the Winsorize Method, we write the exact boundary points as a tuple on the percentile. For example, we will write (0.05, 0.05). This means we want to apply quantile(0.05) and quantile(0.95) as a boundary. The first one is the exact point on percentile from the beginning, the second one is exact point on percentile from the end.

df_bupa_drinks = winsorize(df_bupa['drinks'], (0.05, 0.05) )
sns.boxplot( df_bupa_drinks )

Your Task

Question 1: Remove outliers in the breast cancer dataset by applying Z-score threshold = 1.

1.1 How many samples are removed after the outlier removal ?

Question 2: Apply Winsorize method threshold of lower limit = 0.10 and upper limit = 0.80 on BUPA Liver data set.

Use the "drinks" feature.
2.1 What is the shape of the dataset after applying Winsorize method ?

Question 1: Remove outliers in the breast cancer dataset by applying Z-score threshold = 1.

## Question 1: Remove outliers in the breast cancer dataset by applying Z-score threshold = 1.


z = np.abs(stats.zscore(df))
#print('\nZ-score array:\n', z)

print(df.shape)
# Define a threshold to identify an outlier

(569, 30)

threshold = 1
#print('\nOutliers:\n', np.where(z > threshold))

# Visualizing the outliers using boxplot
plt.figure(figsize=(20,10))
sns.boxplot(data=df)
plt.title('Outlier visualization', fontsize=20)
plt.xticks(rotation=90)

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [Text(0, 0, 'mean radius'), Text(1, 0, 'mean texture'), Text(2, 0, 'mean perimeter'), Text(3, 0, 'mean area'), Text(4, 0, 'mean smoothness'), Text(5, 0, 'mean compactness'), Text(6, 0, 'mean concavity'), Text(7, 0, 'mean concave points'), Text(8, 0, 'mean symmetry'), Text(9, 0, 'mean fractal dimension'), Text(10, 0, 'radius error'), Text(11, 0, 'texture error'), Text(12, 0, 'perimeter error'), Text(13, 0, 'area error'), Text(14, 0, 'smoothness error'), Text(15, 0, 'compactness error'), Text(16, 0, 'concavity error'), Text(17, 0, 'concave points error'), Text(18, 0, 'symmetry error'), Text(19, 0, 'fractal dimension error'), Text(20, 0, 'worst radius'), Text(21, 0, 'worst texture'), Text(22, 0, 'worst perimeter'), Text(23, 0, 'worst area'), Text(24, 0, 'worst smoothness'), Text(25, 0, 'worst compactness'), Text(26, 0, 'worst concavity'), Text(27, 0, 'worst concave points'), Text(28, 0, 'worst symmetry'), Text(29, 0, 'worst fractal dimension')])

plt.show()

# Removing the outliers

df_o = df[(z < threshold).all(axis=1)]
print(df_o.shape)
# Checking if the outliers have been removed

(42, 30)

z = np.abs(stats.zscore(df_o))
#print('\nAfter removing outliers:\n', np.where(z > threshold))

# Visualizing the data without outliers
plt.figure(figsize=(20,10))
sns.boxplot(data=df_o)
plt.title('After outlier removal', fontsize=20)
plt.xticks(rotation=90)

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [Text(0, 0, 'mean radius'), Text(1, 0, 'mean texture'), Text(2, 0, 'mean perimeter'), Text(3, 0, 'mean area'), Text(4, 0, 'mean smoothness'), Text(5, 0, 'mean compactness'), Text(6, 0, 'mean concavity'), Text(7, 0, 'mean concave points'), Text(8, 0, 'mean symmetry'), Text(9, 0, 'mean fractal dimension'), Text(10, 0, 'radius error'), Text(11, 0, 'texture error'), Text(12, 0, 'perimeter error'), Text(13, 0, 'area error'), Text(14, 0, 'smoothness error'), Text(15, 0, 'compactness error'), Text(16, 0, 'concavity error'), Text(17, 0, 'concave points error'), Text(18, 0, 'symmetry error'), Text(19, 0, 'fractal dimension error'), Text(20, 0, 'worst radius'), Text(21, 0, 'worst texture'), Text(22, 0, 'worst perimeter'), Text(23, 0, 'worst area'), Text(24, 0, 'worst smoothness'), Text(25, 0, 'worst compactness'), Text(26, 0, 'worst concavity'), Text(27, 0, 'worst concave points'), Text(28, 0, 'worst symmetry'), Text(29, 0, 'worst fractal dimension')])

plt.show()

Question 2: Apply Winsorize method threshold of lower limit = 0.10 and upper limit = 0.80 on BUPA Liver data set

2.1 What is the shape of the dataset after applying Winsorize method ?

df_bupa_winsorize = df_bupa.copy()
df_bupa_winsorize['drinks'] = winsorize(df_bupa_winsorize['drinks'], (0.10, 0.10) )
#df_bupa_winsorize['drinks']


## Answer 2.1 
##   winsorize clips the outlier data points to the upper and lower limits.