Understanding Feature Scaling in Machine Learning: Techniques, Implementation, and Advantages | by Harsh Joshi

Let’s understand feature scaling with the help of the example above. Here, before scaling, the independent variables ‘Age’ and ‘EstimatedSalary’ have very different ranges. The range for ‘Age’ is 35 to 79, while for ‘EstimatedSalary’, it is 36000 to 90000. This significant difference in feature ranges can pose a problem for certain machine learning and deep learning algorithms. To avoid this problem, we scale our data. As you can see, after scaling, the range of both features now falls within a very similar range.

Common Feature Scaling Techniques:

We’ve seen the concept of feature scaling now let’s understand some techniques to achieve feature scaling. There are two techniques of feature scaling one is Standardization and the second is Normalization. Let’s understand these two techniques deeply.

Standardization (Z-Score Normalization):

Standardization, also known as Z-score normalization is a technique that transforms the feature of the dataset to have a mean of 0 and a standard deviation of 1.

In this process, we subtract the mean of a feature from each current value and divide the result by the standard deviation of the feature. We apply this formula to all the values of the feature to scale it.

Advantages of Standardization: The mean of the standardized feature is always 0, and the standard deviation is always 1, making it easier to interpret the data. Algorithms like support vector machines, K-means clustering, and principal component analysis often perform better when features are standardized.
When to Use Standardization: When features in the dataset have different units of measurement or significantly different scales. and the algorithm you are working with uses distances between data points like clustering algorithms.

Normalization:

Normalization is a technique used in feature scaling to transform the features of a dataset within a specific range. The purpose of normalization is to ensure that each feature contributes equally to the computation of distances and similarities in machine learning models. We will see the main four types of normalization. When we know the minimum and maximum of a feature we apply Normalization.

Min-Max Scaling: Here we calculate the minimum and maximum values of the feature X in the dataset and apply the following formula for that particular feature.

It ensures that all features are on a similar scale, preventing any single feature from dominating the learning process. Min-max scaling can be used while working with KNN, KMean, Artificial Neural Networks and Gradient Descent.

2. Mean Normalization: In this technique, we transform the features in such a way that they have a mean (average) of zero and typically a range between -1 and 1. This technique is also known as zero-mean normalization.

We use mean normalization when we want to centralize the data around zero making it easier to interpret and work with in various analytical and modeling contexts.

3. Max Absolute Scaling: This technique is similar to Min-max scaling but Unlike Min-Max scaling, Max Absolute Scaling scales the features based on the maximum absolute value of each feature.

It preserves the sign (positive or negative) of the data while scaling the values, maintaining the relative relationships between the data points. We use this technique when the feature has too many zeros.

4. Robust Scaling: In this technique we scale a feature of a dataset on their robust statistics, making it resistant to the influence of outliers. Outliers can significantly affect other scaling methods like Min-Max scaling and Z-score normalization, but robust scaling mitigates this issue by using the interquartile range (IQR).

5. Log Transformation: Log transformation is a mathematical operation applied to a set of data to transform it using the natural logarithm.With the help of Log transformation, we can handle the skewness in data or stabilized the variance. This technique can be applied to financial data like stock prices or growth rates. Refer the below code block to understand the implementation of log transformation using Python.

import numpy as np# Sample data
data = np.array([1, 10, 100, 1000])
# Applying log transformation
log_data = np.log(data)
print("Original data:", data)
print("Log-transformed data:", log_data)

Implementing Feature Scaling

Up to this point, we have explored various techniques of feature scaling. Now, let’s implement one of these techniques using sklearn. We will focus on standardization and utilize graphs to reinforce our understanding of feature scaling. We will use the same data that we’ve seen at the beginning of the article.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScalerdf = pd.read_csv('social_network_ads.csv')   # load the data
# train test split
x_train , x_test ,y_train , y_test = train_test_split(df.drop('Purchased',axis=1),
df['Purchased'],
test_size=0.3,
random_state=0)
#create object of StandardScaler
scalar = StandardScaler()
#learn mean and sd accordingly x_train data
scalar.fit(x_train)
#transform train and test data
x_train_scaled = scalar.transform(x_train)        
x_test_scaled = scalar.transform(x_test)

First, we load the data and then split it into x_train, x_test, y_train, and y_test. Next, we create an instance of StandardScaler. By applying the ‘fit’ method on the x_train data, the scaler object learns the mean and standard deviation of the dataset. During the transformation, the scaler object applies the scaling formula using the calculated mean and standard deviation, storing the scaled data into x_train_scaled and x_test_scaled.

Comparison of data before scaling and after scaling

Let’s understand the scatter plot of ‘EstimatedSalary vs. Age’ before and after scaling. At first glance, the scatter plot may appear unchanged, but upon careful observation of the axis of both plots, the difference becomes apparent. Prior to scaling, the range of the x-axis (EstimatedSalary) was approximately 10,000 to 1,50,000, and the range of the y-axis (Age) was around 20 to 60. However, after scaling, the range of both axes now falls within 1 to 2.5. That was the main objective of feature scaling, and we have successfully achieved it!

Distribution of Age before scaling and after scaling

Now, let’s examine the distribution plot of ‘Age’ before and after scaling. You will observe that the data distribution remains unchanged after scaling. This preservation of the data’s shape and distribution is a critical property of standardization. This is why, in machine learning, standardization is a commonly used approach for solving a variety of problems.

Conclusion

In conclusion, feature scaling is a fundamental preprocessing step in machine learning that significantly impacts the performance and stability of our models. Throughout this blog, we’ve explored the importance of feature scaling, various types, and techniques such as Min-Max Scaling, Standardization, Robust Scaling, Log Transformation, and Max Absolute Scaling. Each technique comes with its own advantages, making them suitable for different scenarios based on the specific characteristics of the data. By understanding and implementing feature scaling effectively, we pave the way for better machine-learning models that can unlock valuable insights and drive informed decisions in various domains. So, the next time you’re preparing your data for a machine learning project, remember the power of feature scaling and choose the appropriate technique based on your data’s unique characteristics

Source link