GroupBy in Pandas DataFrame: A Comprehensive Guide with Examples and Practice Problems | by Rany ElHousieny

Pandas, a popular data manipulation library in Python, provides the GroupBy feature to efficiently group and analyze data in a DataFrame. GroupBy allows us to split our data into groups based on one or more criteria, apply calculations or transformations on these groups, and then combine the results. This article will explore various aspects of GroupBy in Pandas DataFrame, discussing its syntax, functionality, and providing detailed examples with corresponding outputs. Additionally, we’ll include some practice problems to solidify our understanding of this powerful feature.

The syntax for using the GroupBy feature in Pandas is as follows:

df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False)

by: Specifies the column(s) to group the data by.
axis: Determines whether the grouping is performed along rows (axis=0) or columns (axis=1).
level: Enables grouping based on a specific level or multi-levels in the DataFrame’s index.
as_index: Determines whether the grouped column(s) become the index of the resulting DataFrame.
sort: Defines whether to sort the resulting DataFrame by the columns used for grouping.
group_keys: Indicates whether to add a group key to the index of the resulting DataFrame.
squeeze: Returns a Series instead of a DataFrame if possible.
observed: Controls whether to include all values from the original DataFrame’s data when grouping on categorical variables.

Before you start in GroupBy, you need to be familiar with aggregation. The following article explains aggregation in detail:

To demonstrate the practical usage of GroupBy, let’s consider a sample DataFrame containing information about students and their test scores:

import pandas as pddata = {'Student': ['John', 'Alice', 'Bob', 'John', 'Bob', 'Alice'],
'Subject': ['Math', 'Science', 'Science', 'Math', 'Science', 'Math'],
'Score': [80, 90, 70, 75, 95, 85]}
df = pd.DataFrame(data)

The DataFrame df consists of three columns: ‘Student,’ ‘Subject,’ and ‘Score.’

2.1 GroupBy based on a Single Column:

We can group the data by a single column, such as ‘Student.’ To calculate the average score per student, we can use the following code:

student_group = df.groupby('Student')
average_score_per_student = student_group['Score'].mean()
print(average_score_per_student)

Output:

Student
Alice    87.5
Bob      82.5
John     77.5
Name: Score, dtype: float64

The output displays the average score for each student.

2.2 GroupBy based on Multiple Columns:

We can also group the data by multiple columns to obtain more specific insights. Let’s group by both ‘Student’ and ‘Subject’ to calculate the average score per student, per subject:

student_subject_group = df.groupby(['Student', 'Subject'])
average_score_per_student_subject = student_subject_group['Score'].mean()
print(average_score_per_student_subject)

Output:

Student  Subject
Alice    Math       85
Science    90
Bob      Science    70
John     Math       77.5
Name: Score, dtype: float64

The output illustrates the average score for each student in every subject.

Using the previous DataFrame, solve the following problems:

Problem 1:

GroupBy based on ‘Subject’ column and calculate the maximum score for each subject.

Solution:

Problem 2:

GroupBy based on ‘Subject’ column and compute the minimum score for each subject by using the ‘agg’ function.

Problem 3:

GroupBy based on ‘Subject’ and ‘Student’ columns, and determine the number of scores greater than 80 for each student.

GroupBy in Pandas DataFrame is a powerful feature that facilitates data grouping and analysis based on one or more criteria. By leveraging GroupBy, you can efficiently summarize, transform, and gain insights from your data. This article provided a detailed explanation of GroupBy’s syntax, along with practical examples and corresponding outputs. Practice problems were also included to reinforce the concepts covered. Harness the power of GroupBy to unlock deeper insights from your data using Pandas.

Source link