Statistics for Machine Learning. Statistical concepts are basically used… | by Ayushman Durgapal

Statistical concepts are basically used to clean, visualize, and analyze and draw insights from a dataset. Statistics of the whole or a part of the dataset helps us to unveil a lot of information about the data that might be useful in making important decisions. This article broadly covers the important statistical concepts that are required to make sense from data and also understand machine learning algorithms. Statistics is basically of two types-

Descriptive Statistics — Descriptive Statistics involves summarizing the data in the form of tables, charts and graphs. It shows the features or characteristics of the dataset. It also includes calculation of important measures that inform more about the data. Visualizing data through different charts like bar chart, scatter plot, pie chart, histogram, etc. gives clarity about the distribution of data, independent and dependent variables, thereby making it easy to interpret the data.
Inferential Statistics — As the name suggests, inferential statistics involves drawing insights about the entire population/dataset from a sample. There are several tests that are run on the sample data, the insights from which can be used to derive conclusions about the population. Since, we do not analyze the entire dataset and the information is based on the analysis done on a sample, the results are not 100% accurate and are in the form of probability.

Types of Data

Basic Statistics Concepts

Mean

Mean is the arithmetic average of all the values of a particular variable.

Formula to calculate mean –

We can use the AVERAGE() function in excel to calculate the mean.

Median

Median is the mid value of a particular variable after all the values are sorted in ascending order. We can use the MEDIAN() function in excel to find the median.

Mode

Mode is the most frequently occurring value in a particular variable. MODE() function in excel returns the most frequent value.

Note: Mean, Median, and Mode are known as measures of central tendency.

Standard Deviation

Standard Deviation shows how spread the distribution is, or in other words how far from the mean does a data point lie.

Formula to calculate Standard Deviation –

SD for population –

Where N is the total number of data points in the population and μ is the population mean.

SD for sample –

Where n is the total number of data points in the sample and x̄ is the mean of the observations

This sample is used to infer the SD of the population.

Note: The Square of Standard Deviation is called Variance.

STDEV.P and STDEV.S functions are used in excel to calculate population and sample standard deviations respectively.

VAR.P and VAR.S functions are used in excel to calculate population and sample variances respectively.

Range

Range is the difference between the highest and the smallest value in a dataset.

In excel, we can use the MIN() and MAX() function and take their difference to calculate the range.

Interquartile Range

A percentile shows the percentage of data points that lie below a particular value for a variable. For example, value at the 80th percentile shows that 80 percent of data points lie below that value.

Quartiles divide the whole dataset into 4 equal parts. 1st quartile, 2nd quartile, 3rd quartile, and 4th quartile are the values at 25th percentile, 50th percentile, 75th percentile, and 100th percentile respectively.

The difference between the 1st and the 3rd quartile is known as the Interquartile Range (IQR). IQR is important in knowing about the spread of data and also to find the outliers.

Generally, values that are less than (Q1–1.5*IQR) or greater than (Q3 + 1.5*IQR) are considered as outliers.

Confidence Interval

A confidence interval is the range of values between which the observation is most likely to fall. There are different levels of confidence intervals. For example, a 95% percent confidence interval would mean that it is 95% certain that the estimate would lie between the calculated range.

Confidence interval can be calculated as –

Where z is the critical Z-score calculated based on the level of confidence, s is the sample standard deviation and n is the sample size.

z * (s/ √n) is known as margin of error.

Simpson’s Paradox

Simpson’s Paradox is a phenomenon in which data shows a certain trend when it is put into groups but gets reversed or disappears when the data is combined.

Central Limit Theorem

The central limit theorem states that if you take a sufficiently large number of samples from the population with replacement, then the distribution of the sample means is very close to a normal distribution.

Generally, if the number of samples is more than 30, the distribution of sample means is assumed to be normal.

If n is the number of samples taken, then the standard deviation of sample means would be –

Where 𝝈 is the standard deviation of the population.

This is also known as the standard error of the mean.

Probability Distributions

A probability distribution shows the relationship between all the different values a variable can take and their probabilities of occurrence. A probability distributions is plotted based on a function which can be of two types –

1- Probability Density Function(PDF) — A probability density function returns probability of occurrence of a particular value.

2- Cumulative Distributive Function(CDF) — A cumulative distributive function shows the probability of occurrence of a particular value or a value less than that.

Different types of probability distributions

1- Bernoulli Distribution — A Bernoulli distribution has 2 possible outcomes. One is success and the other is failure. If p is the probability of success, then 1-p is the probability of failure. The mass function is given by-

Where n ϵ (0,1)

The expected value of the function is –

E(X) = 1*p + 0*(1-p) = p

Variance = p(1-p)

2- Uniform Distribution — A uniform distribution is a distribution in which each observation is equally likely. If a and b are the two extreme values of the variable, then the mass function is given by –

P(n) = 1/(b-a)

The mean is given by (a+b)/2 and the variance is given by (b-a)^2/12

3- Binomial Distribution — A binomial distribution has two possible outcomes, one is success and the other is failure and the probability of success and failure is same for all the trials. This type of distribution is used when we need to find the probability of x successes out of n trials. If p is the probability of success, then the mass function is given by –

The mean is given by n*p and the variance is given by n*p*(1-p).

4- Normal Distribution — A normal distribution is a bell-shaped distribution which is symmetric about its mean. It is also known as a Gaussian distribution. In a normal distribution, the mean, median, and the mode are all equal. If the mean is μ and standard deviation is 𝝈, then the mass function is given by –

Empirical Rule (68–95–99.7 Rule)

This rule states that for a normal distribution 68% percent of the data is between 1 standard deviations, 95% of the data is between 2 standard deviations, and 99.7% of the data is between 3 standard deviations.

5- Poisson Distribution — Poisson distribution is a distribution that is used to determine the probability of the number of times an event occurs in a given time period. The mass function of this distribution is a discrete function.

If μ is the mean and t is the length of the time interval, then 𝛌 is given by μ*t.

The mass function is given by –

Where x is the number of events in time interval t.

The mean and variance of the distribution is equal to μ.

6- Exponential Distribution — Exponential distribution is a distribution that is used to determine the probability of time elapsed between two events. If μ is the mean, then the decay parameter 𝛌 is given by 1/μ.

The mass function is given by –

Where x is equal to the time interval between events.

The mean and standard deviation is equal to μ.

Standardization

There can be instances where the scale of variables might be varied and hence making a comparison becomes difficult. So, to avoid this the variables are brought on the same scale. This is known as standardization.

In standardizing variables, we calculate their z-scores.

A Z-score is calculated by –

Z = (x-μ)/𝝈

Chebyshev’s Theorem

This theorem states that at least (1- 1/z²)*100 percent of the observations in any data set will be within z standard deviation of the mean, where z is any number greater than 1.

Skewness

Skewness basically refers to the difference in the symmetry of the dataset. If a bell curve is not symmetric but is shifted to the right or left, it is said to be skewed. A normal distribution has 0 skewness.

Datasets having a positive skewness are right skewed and those having negative skewness are left skewed.

The mean is greater than median for a right skewed dataset and the median is greater than the mean for a left skewed dataset.

Kurtosis

Kurtosis is a measure that is used to determine the distribution of data towards the tails of the bell curve. Datasets having high kurtosis have more data on the tails which brings the tails towards the mean , and datasets having low kurtosis have less data on the tails which pushes the tails away from the mean.

Types of kurtosis in a dataset –

Mesokurtic — A mesokurtic distribution has a kurtosis of 3 and this distribution is similar to a perfect normal distribution.
Leptokurtic — A leptokurtic distribution has a kurtosis greater than 3. Thus, it has more data on its tails and hence, longer and thinner tails.
Platykurtic — A platykurtic distribution has a kurtosis lesser than 3. Thus, it has less data on its tails and hence, shorter and thicker tails.

Covariance and Correlation

Covariance is a measure that is used to determine the direction of relationship between two variables. The value of covariance can range from -∞ to +∞. A negative value shows that if the value of one variable increases, the other one decreases. A positive value shows that the value of one variable increases with the other. It can be calculated by the following formula –

Where x̄ and ȳ are the mean values of x and y variables and n is the number of observations.

Sₓᵧ> 0: Positively linearly related

Sₓᵧ= 0: Not linearly related

Sₓᵧ< 0: Negatively linearly related

Covariance is not so reliable in interpreting the magnitude of relation, as its value can be quite varied. Thus, correlation is a much valid measure to determine the strength and direction of the relationship.

Correlation rₓᵧ closer to +1: Strong positive linear relationship

Correlation rₓᵧ closer to 0: Weak linear relationship

Correlation rₓᵧ closer to -1: Strong negative linear relationship

Spearman Rank Correlation

Spearman Rank Correlation is used to determine the correlation between two ranked variables. Its value ranges between -1 and +1.

Where dᵢ is the difference in the ranks of the two variables for each observation and n is the number of observations.

Coefficient of Determination (R²)

The R² value is used to determine the percentage of variation in one variable that can be explained by other variables. It is basically used to test the quality of the model. It shows how well the model predicts the outcome and is sometimes referred to as goodness of fit. The value of R² is between 0 and 1.

Confusion Matrix

A confusion matrix shows the difference in the values predicted by the model vs the actual values. It is mainly used to estimate the accuracy of classification models.

Terminologies of a confusion matrix –

True Positive — The model predicted positive and it’s actually true.
True Negative — The model predicted negative and it’s actually true.
False Positive (Type 1 error) — The model predicted positive but it’s actually false.
False Negative (Type 2 error) — The model predicted negative but it’s actually false.

Sensitivity (True Positive Rate) = TP/(TP +FN)

False Positive Rate = FP/(FP + TN)

Precision = TP/(TP + FP)

Receiver Operator Curve (ROC)

This is a curve between True Positive Rate(Sensitivity) and False Positive Rate(1-Specificity). This provides a simple and effective way to summarize all the information about different confusion matrices that can be created by changing certain parameters in the model.

The Area Under Curve (AUC) of different models gives a comparison between different models. The model having a larger AUC should be chosen.

Here, clearly the model with red line has higher AUC for the ROC curve and so, should be chosen over the blue one.

Hypothesis Testing

Hypothesis testing is an analysis that is done to determine the relationship between two variables. In this testing, an assumption is put to test and is concluded if the assumption was correct.

The assumption made is called the null hypothesis, denoted by H₀ and the opposite of this null hypothesis is the alternate hypothesis Hₐ. As a convention, we take the null hypothesis as something that does not change the population or in other words the event that is being tested will not occur.

A one-tailed test is a directional test in which the sample is tested for being greater or lesser than a specific value, whereas in a two-tailed test, the sample is tested for both greater and lesser than a specific value. A one-tailed test has a single region of values (rejection region) and if the sample falls in this region, then the null hypothesis is rejected. A two-tailed test has two such regions of values.

A P-value is a metric that is used to determine whether to accept or reject the null hypothesis. Generally, a value of 0.05 (alpha value) is the threshold for rejecting the null hypothesis. The null hypothesis is rejected if the P-value is less than 0.05.

Types of Hypothesis Tests

1- Z-Test — A Z-test is used when the population standard deviation is known or if population standard deviation is unknown but the sample size is significantly large (n>=30).

For a single sample z-test, first the null and alternate hypotheses are formed and a z-statistic is calculated as follows —

Now, for a significance (α) value, a critical z-score is calculated. The values beyond this z-score form the region of rejection. Thus, if the z-statistic falls in the rejection region, we reject the null hypothesis.

A two-sample Z-test is used to compare the means of two different samples. This is also used when the population SD of both the samples is known or if the sample size is significantly large.

The z-statistic is calculated as –

Where μ₁ and μ₂ are the population means, 𝜎₁ and 𝜎₂ are the population standard deviations, and n₁ and n₂ are the sample sizes of the two samples.

Similar to the one-sample test, if the z-statistic lies in the rejection rejection region, then the null hypothesis is rejected and the alternate hypothesis is accepted.

2- T-Test — A T-test is similar to a Z-test but is used if the population standard deviation is unknown and the sample size is small (n<30).

For a single sample T-test, first the null and alternate hypotheses are formed and a t-statistic is calculated as follows —

Where μ is the population mean, S is the sample standard deviation and n is the sample size.

Similar to T-test, if the t-statistic falls in the rejection region, we reject the null hypothesis.

For a two-sample T-test, the t-statistic is calculated as –

Where μ₁ and μ₂ are the population means, s₁ and s₂ are the sample standard deviations, and n₁ and n₂ are the sample sizes of the two samples.

If the t-statistics lies in the rejection rejection region, then the null hypothesis is rejected and the alternate hypothesis is accepted.

Test for proportions — In case of testing for proportions where we know the population as well as the sample proportion, we use the following formula to calculate the test statistic –

Where p̂ is the sample proportion and p₀ is the population proportion.

For a two sample test for proportions, we use

Where p̂₁ and p̂₂ are the proportions of 2 samples and p̂₀ is the population proportion.

3- Chi-Squared Test — A Chi-Squared test is used to determine the difference between the actual value and the predicted value. It is used for categorical data and is used to check if the actual data is different from the expected data. There are two types of Chi-Squared Tests –

Chi-Squared test of Independence — This is used to determine if two categorical variables are independent from each other. This is generally used for nominal data. The null hypothesis, by convention, assumes that the two variables are independent. Now, for this test we need to prepare a contingency table and find the chi-squared statistic.

For example, there are two variables having categories A,B,C,D and 𝛂,ꞵ,𝝲 and we are required to find if these variables are related, we will perform a chi-squared test of independence.

Contingency Table

The values in the bracket are the expected values calculated as (nᵢ*nⱼ)/N where nᵢ and nⱼ are the totals of the row and column respectively and N is the grand total.

A chi-squared statistic is calculated as –

Now, a chi-squared value is found from the chi-squared table based on the significance value (alpha value) and on the degrees of freedom which is calculated as — (r-1)*(c-1) where r and c are the number of rows and columns in the table respectively.

If the chi-squared statistic is greater than the chi-squared value, then the null hypothesis is rejected, and vice versa.

Chi-Squared test for Goodness of Fit — This is similar to the test of independence. It is used to check the statistical model’s goodness of fit. It checks how well the distribution of the expected value (obtained from the model) fits the distribution of the original value. If the goodness of fit is high, the expected values are closer to the observed values and vice versa. The null hypothesis is assumed that the distribution of expected values is the same as distribution of observed values. This test is generally used with a single categorical variable.

The method for this test is the same as the test of independence. The chi-squared statistics are calculated in the same way and then the chi-squared value is obtained from the table. For this test, the degrees of freedom is (n-1), where n is the number of different categories in the variable.

Now, similar to the above test, if the chi-squared statistic is greater than the chi-squared value, then the null hypothesis is rejected, and vice versa.

4- Analysis of Variance (ANOVA) Test — The ANOVA test is used to test if there is a difference between 2 or more groups. It is basically an extension of the T-test. ANOVA test is of two types –

One-way ANOVA test — A one-way ANOVA test is used to test if there is any difference between the means of more than two groups of an independent variable. The null hypothesis is assumed that there is no significance difference between the means of different groups.

H₀: μ₁ = μ₂ = μ₃ = ….μₖ

Where k is the number of groups

The alternate hypothesis is that at least the means of two groups differ from each other.

Let’s suppose that we have the following data –

After stating the hypotheses, the degrees of freedom are calculated.

df(between) = a-1 = 2

df(within) = N-a = 18

df(total) = N-1 = 20

Where N is the total number (here 21), n is the number of observations in each group (here 7), and a is the number of groups of levels (here 3).

To find a critical value, df(between) and df(within) are used.

Now, an f table is used that has df(within) and df(between) to find the critical value. So, if the test statistic is greater than this critical value, then the null hypothesis is rejected. For this data, the critical value from the table is found to be 3.554.

To calculate the test statistic, we first calculate the sum of squares of between and within:

So,

A(total): 57

B(total): 47

C(total): 21

T: 57+47+21 = 125

SS(between) = 98.67

Y² = squared sum of all individual values = 92 + 82 + 72 + 82 + …..+22 = 853

SS(within) = 10.29

SS(total) = 108.95

Calculating test statistic:

F = MS(between)/MS(within)

MS = Mean Square = SS/df

This F is the test statistic. Since, F is greater than the critical value, we reject the null hypothesis. Thus, the means of A, B, and C are different from each other.

Two-way ANOVA test — A two-way ANOVA test is used to test if there is any difference between the means of two or more groups of more than one independent variable. This test is used to check the variation between the two independent variables, the means of different groups, and if the means of the two independent variables are different. So, accordingly three null and alternate hypothesis pairs are formed.

Here, we form the hypotheses that the means of A,B,C are equal, the means of 𝛂,𝛃 are equal, and the two independent variables do not affect each other.

Now, degrees of freedom are calculated:

df(column) = a-1 = 2–1 = 1

df(row) = b-1 = 3–1 = 2

df(column x row) = (a-1)(b-1) = 2

df(error) = N-ab = 36 — (2)(3) = 30

df(total) = N-1 = 35

Now, three decision rules are stated for three different hypotheses and find the critical values from the f table:

Row(df(row), df(error)) = (2,30) = 3.32

Column(df(column), df(error)) = (1.30) = 4.17

Row x Column(df(column x row), df(error)) = (2,30) = 3.32

Now, similar to the one-way test, we calculate the test-statistic for each pair of hypotheses.

But before that we need to find the sum of squares –

SS(column) =

a₁ = 3+4+5+…+7 = 93

a₂ = 1+2+1+1…+9 = 84

SS(column) = 2.25

SS(row) =

b₁ = 3+4+5+…+2 = 30

b₂ = 5+4+3+….+4 = 53

b₃ = 7+8+7+….+9 = 94

SS(row) = 175.17

SS(column x row) =

a₁b₁ = 3+4+5+3+4+3 = 22

a₁b₂ = 5+4+3+5+5+5 = 27

And so on…

SS(column x row) = 17.16

SS(total) =

SS(total) = 210.75

Calculating test statistic:

SS(error) is calculated by subtracting all the sum of squares from the total.

MS and F are calculated the same way as the one-way test.

Now, since the F value for the column is not greater than the critical value, the null hypothesis is not rejected. Thus, there is not enough evidence to prove that the mean of 𝛂 is not equal to the mean of 𝛃.

The F value for the row is greater than the critical value, so the null hypothesis is rejected showing that the mean of A,B, and C are not equal.

The F value for the row and column interaction is also greater than the critical value, so the null hypothesis is rejected showing that the two variables affect each other.

In conclusion, mainly these four types of hypothesis tests are used in statistics to test various assumptions and make changes to the models accordingly to obtain a robust model for better results.

This article covered almost all the basic concepts in statistics that are required as a prerequisite to getting started with machine learning.

That’s all folks!

Cheers!