To cap or clip outliers in a column, you can use the clip
method in pandas. The clip
method allows you to set a minimum and maximum threshold for the values in a DataFrame or a specific column. Here’s an example:
def treat_outliers(df, col):
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
# treating outliers of a column
data = treat_outliers(data,'your column name')
Clipping is a simple method, and it’s important to consider the impact on your data and analysis. If you need a more sophisticated approach, you might want to explore other techniques for handling outliers, such as using z-scores, percentiles, or more advanced statistical methods.
Here is another example:
#Calculating top 5 values
data['total sulfur dioxide'].sort_values(ascending=False).head()
1081 289.0
1079 278.0
354 165.0
1244 160.0
651 155.0
Name: total sulfur dioxide, dtype: float64
#Capping the two extreme values
data['total sulfur dioxide']=data['total sulfur dioxide'].clip(upper=165)
The two rows that have total sulfur dioxide
greater than 165 are now updated with 165.