Normal, Binomial and Poisson Distributions | by D Varghese

Explained in simple terms

A major tech company has been successfully recruiting college grads to fill their data scientist positions since 2000. They recruited graduates with math or science majors. Applicants were chosen through a screening test. Candidates passing the screening test were called for three back to back interviews at their headquarter. Offers were given to candidates who passed all the three interviews. The process looked like this.

This year, the recruitment team forecasted that they wouldn’t be able to find enough recruits from 2027 onwards. The company was secretly planning to make a major investment to expand their analytics and data science capabilities. The company leadership acted quickly on that forecast and setup a committee to investigate and propose remedial steps.

After two excruciating weeks of reviews and meeting, the committee presented their report to the company leadership. A portion of the report got leaked to the press, which faithfully reported it.

…. major technology company based in the bay area is on the verge of pulling back plans for a major expansion. A team setup to oversee the plan recommended delaying it indefinitely. A source who didn’t want to be identified told this reporter that recession fears….

What was in the actual report?

Was that news report true?

To counter the false news report, the company sent out an email communication to its employees. It said:

Dear employees,

Thanks for your hard work………………………………………………

We are planning to make some huge advancements in the area of data analytics and AI, building on top of our already established………..

The summary of the emails was that the company is planning to hire as many as 500 graduate data scientists from top-tier universities starting 2027. The company had been running on-campus recruitment drives from 2000s. The company recruited only from select top-tier universities. At the start in 2000, only 200 graduates applied for the position. It kept consistently growing till the application levels reached around 2000 in 2020. After 2020, applications dropped but stayed roughly at 1000. Graduates are highly motivated and enthusiastic about this position. The screening test and interviews eliminate quite a large percentage of the applicants. The company is proud to have the best among the best data scientists in the world. But they are concerned about how to increase the hire without compromising on quality.

Though the email didn’t go beyond that, the committee gave out some facts for those who were interested.

What did the committee say?

The committee gave out some interesting facts and recommendation.

What were the facts?

Only around 25% of applicants pass the screening test.
There are 100 questions in the screening test. The minimum passing score is 70%. Level 10 being the most difficult, the questions are distributed as following:
– 5 with difficulty level 9
– 20 with difficulty level 8
– 25 with difficulty level 7
– 25 with difficulty level 5
– 20 with difficulty level 4
– 5 with difficulty level 2
The current trend shows only a small portion of the screened candidates pass the 3 interviews to get an offer.
Only 2 out of 100 hires major in chemistry. (Why is that relevant??)
The company needs a recruitment pool of 16,000+ graduate to achieve a targeted hiring level of 500 year in near future IF the current recruitment strategy continues. (Why such a huge pool??)

The company would have to spend huge amounts of time and money to vet and recruit from 16,000 candidates. Even if the company recruits from additional universities, that is not a sustainable. So the recruitment criteria for screening and interviews need changes.

Committee recommendations

Since only 25% pass the screening test, the committee concluded that the classification of difficulty level might not be in alignment with test takers difficulty assessments. They didn’t recommend reclassifying the question as they didn’t have a scientific backing to that. Instead they asked the test preparers to tilt the test towards the lower difficulty levels to avoid loosing some potentially hirable candidates. The interviewers will get an opportunity to talk to more candidate if more candidates pass the test. The minimum passing score shouldn’t change from 70%. The new test composition must be:
– 5 with difficulty level 8
– 20 with difficulty level 7
– 30 with difficulty level 5
– 20 with difficulty level 4
– 15 with difficulty level 3
– 10 with difficulty level 1
Some candidates who performed very well in 2 interviews failed to impress the interviewer in a third one. This resulted in eliminating many good candidates at the end of the interviewing process. Even some interviewers who vouched for certain candidates were upset. In future, candidates who win 2 of the 3 interviews should be given an offer.
The company need to maintain a pool of 8 to 10 data scientists with chemistry background. This is to perform some specific tasks related to the new expansion. The committee recommended hiring them from the general public as there was a slim chance of filling them from college. It is risky to leave those positions open.

These were all data-driven facts and action plans. A group of data analysts and data scientists had worked round the clock for the committee to arrive at these conclusions. Let us look at the work behind it and learn some concepts as we read along!

How did the committee arrive at those recommendations?

First, let us start with the reasoning for the 16,000 applicants IF the current hiring strategy continued. Then we can look at the recommendations to change the hiring strategy which included changing the test format, cutting down the interviews to two, and finally the plan to hire chemistry majors.

Why should 16,000 apply to fill 500 positions IF the current strategy continues?

Since only around 25% pass the screening test, roughly only 4000 candidates will remain after the screening. Based on the patterns from the previous years, only 12.5% of those interviewed passed all the 3 interviews, which gives 4000 * 0.125 = 500.

Why Test format changes were recommended?

The team which created the current screening test, with all good intentions, was keen to keep it balanced in terms of difficulty levels. Every year they spent hours and weeks to prepare questions, classify its difficulty levels and then pick and choose the ones for the screening test. The model they followed ensured that not many extremely difficult nor easy one were picked, and most questions were from the middle difficulty area.

So what was the model? After drafting the questionnaire each year, they used to plot the level of difficulty versus the number of selected questions in each levels. Until they got a smooth symmetric graph equally distributed around the mean difficulty level of 5, they kept modifying the questionaire.

This type of distributions where the counts towards the mean occur more frequently than ones far away from the mean is called Normal Distribution. It is also known as Gaussian Distribution.

Starting in next few years, they will change the test per the committee recommendation to keep the mean closer to the questions with lower difficulty levels. It will no more be a normal distribution. See the proposed future distribution.

Why Interview format changes were recommended?

The intent of the interviews were to eliminate most of the people. Why? That is because the company wanted the best of the bests. The statistical probability of a candidate winning all the three interviews is just around 1/8th (12.5%). In other words, if 4000 candidates appear for interviews, only 500 will get an offer. There are 8 possible outcomes from the 3 interviews. Only 1 outcome (Win all 3) results in an offer.

If the company changes the offer criteria from winning all three to winning a minimum of two, the probability of a candidate getting an offer leaps from 12.5% to 50%. How is that possible? It can happen in 3+1=4 out of 8 outcomes.

A- Chances of 3 Win, 0 Fail = 1 way for offer
B- Chances of 2 Win, 1 Fail = 3 ways for offer

While keeping the number of wins at 2 for an offer, the company could also increase the total number of interviews. This increases the chances of offer. If the number of interviews is increased to 5, what are the chances for a candidate to get an offer? The data team provided the committee a copy of the Pascal’s triangle for this.

There are 32 possible outcomes from 5 interviews. An offer is given if a candidate wins a minimum of 2 interviews, which translates to:

Chances of winning all 5 interviews = 1/32 (3.12%)
Chances of winning 4 interviews = 5/32 (15.62%)
Chances of winning 3 interviews = 10/32 (31.25%)
Chances of winning 2 interviews = 10/32 (31.25%)

Therefore the chances of winning an offer = (1+5+10+10) / 32 = 26/32 = 81.25%. So the chances of getting an offer goes up from 50% to 81.25%.

Plotting the number of wins versus the its probability gives you a binomial distribution. It is “bi”nomial because it represents an event happening (winning) or not happening (not winning),

An event can be considered like “winning 5 interviews”. The probability of this event is 3.12%. The probability of this event not happening = 100–3.12 = 96.88%. That 96.88% (ignoring rounding errors) is exactly equal to the probability of winning 4 interviews (15.62%) + probability of winning 3 interviews (31.25%) + probability of winning 2 interviews (31.25%) + probability of winning 1 interview (15.62%) + probability of winning no interviews (3.12%).

The green bars show the area of the graph where an offer is made (2 wins or 3 wins or 4 wins or all 5 wins). If you look closely at dotted curve, you would notice a resemblance to a normal distribution curve. As the number of events increases (interviews here), the curve will slowly turn into a symmetric bell shaped one.

Note: In this case, we had considered that the chances of wining all the interviews as equal. But in real life that is never the case. The interviewer at interview 1 could be failing more candidates compared to the rest of the 2 interviewers. Say the chances of winning interview 1 is 30%. If so, that probability should be considered in this calculation.

What is the probability of hiring 8–10 chemistry majors?

The statistic from the past years show that for every 100 hires an year, 2 were chemistry majors on an average. In some year no chemistry majors were hired, in some years up to 6 were hired for every 100. Note: For 500 hires, it will result in around 10 (2 per 100 * 5) chemistry majors hired. But there is no guarantee that this will happen.

There is a mathematical pattern in the above observation. Let us analyze this. Say,

“Hiring a chemistry major in every 100 hires” is an event
The time interval is one year
The mean rate (average number) of “Hiring a chemistry major in every 100 hires” per time interval is 2 (lambda)
The number of “chemistry majors hired in every 100 hires” in a specific year varies (k)

Quoting Wikipedia:

This pattern where the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event is called Poisson distribution

Using the historical hiring data, there were only 24% of chances to hiring 2 chemistry majors in 100 hires in an year. The same 24% applies to chances of hiring 3 per year. 5% for chances to hire 1 per year, and around 10% for chances to hire 4 per year. Beyond 4, chances were close to nill. The following Poisson distribution plot summarizes it.

This finding is what led to the committee recommending chemistry majors to being hired from the public.

Conclusion

The committee’s data driven approach became a highly regarded piece of work in the organization. The data science team behind it was also rewarded as well. More and more work began to pile-up on the desk of the data science team. Now the challenge became, beyond the 500 target — how many more do they need to hire!

If you enjoyed this content, please give it a like and follow me at Medium! Your support helps create more valuable content for you. Thank you for your support!

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity