![](https://crypto4nerd.com/wp-content/uploads/2023/07/18hnMKIdrBXv_HzwgvzYSOQ-1024x538.png)
Table of contents:
∘ Mathematical Transformers:
∘ Function transformer:
∘ How to find if data is normal?
∘ QQ Plots:
∘ Some more examples:
∘ Log Transform:
∘ When should I use the log transformation?
∘ Reciprocal transform:
∘ Square transformation:
∘ Example:
Mathematical Transformers:
The idea is very simple,you apply some mathematical formula to your columns and transform them into something else.
There are different types of transformations.
- Log transform
- Reciprocal transform
- Power(square-root) transform
- Box-cox
- yeo-johnson transform
Now the question is, What happens after applying this transformation?
The distribution of your data, i.e., pdf, is going to be converted into a normal distribution.
The data you will get is not normally distributed.the end goal of all transformations is to distribute your data normally.
The Reasons of distributing data normally is
- Assumption of many algorithms: ML algorithms like linear regression, logistic regression, and parametric models assume data is normally distributed, avoiding biased estimates and suboptimal performance if this assumption is violated.
- Statistical inference: Normality simplifies and enhances reliability in statistical tests and confidence intervals, aiding data analysis and parameter inference.
- Model performance: Some models, e.g., k-nearest neighbors and support vector machines, may perform better with normally distributed data.
- Stability of estimates: Normality improves stability and generalization by reducing sensitivity to extreme outliers and skewed data.
- Central Limit Theorem: As sample size increases, the sample mean approaches a normal distribution, underpinning many statistical techniques in ML.
Function transformer:
Our machine learning algorithm liberary is scikit-Learn; you will notice that there are three transformers in it.
- Function transformers (most used)
- Power transformers
- Quantile transformers
How to find if data is normal?
There are 2-3 ways to find out if the data is normal or not
- by plotting distribution plots using seaborn
sns.distplot()
This gives an idea.how normally the data is distributed.like it’s very skewed.
2. by using Panda’s skewed function
pd.skewd()
if the output of this code is zero,” then it is normally distributed data If it is a positive or negative number, then it is skewed.
3. By plotting a QQ plot.This is the most common method for checking the distribution of the data, and it is reliable.
QQ Plots:
The image on the left is a pdf (probability density function), and another is a QQ plot.
The pdf shows the dataset is normally distributed.For normally distributed data in the QQ plot, all the points will come above the 45-degree line.as you can see in the below image.
If the data is overly skewed in the middle, your QQ plot will be slightly deviated from line, and the line will deviate slightly
Some more examples:
Log Transform:
Suppose you have ‘age ‘ column in the data that contains different ages.if you have to apply the log transformation to it.So what you do is “you take a log of each value.”
The base of the log depends on you.base 10 or base 2.What happens when you take a log? The data becomes normally distributed, not completely, but better than the current stage.
When should I use the log transformation?
- The log transformation will not be applied to negative values.You cannot take logs of negative values.
- when you have right skewed data.then you can apply the log transformation.It brings data to the center.
What log transformation does is convert additive scale to multiplicative scale.
Reciprocal transform:
In reciprocal transformation, your big values will become small, and your small values will become big.This transformation can only be used for non-zero values.
Square transformation:
It’s especially used on left-skewed data.
Example:
I’ve added a Jupyter Notebook showcasing a predictive model’s accuracy on the Titanic dataset. It compares accuracy with and without transformations, using only age, fare, and survival status (survived).
Find the notebook on my GitHub:https://github.com/paresh122/blog_notebooks/tree/main/Function%20transformer