![](https://crypto4nerd.com/wp-content/uploads/2023/02/1xJWDbm8axZb-ZRXCil6OeA-1024x855.png)
Over the past few years, WhatsApp has become one of the most popular messaging platforms used by programming groups worldwide. With the rise of remote work and the need for effective collaboration, these groups use WhatsApp to stay connected, share ideas, and soeelve problems.
In this article, we’ll take a deep dive into the WhatsApp chats of our programming group from July 2020 to February 2023. Our aim is to conduct a comprehensive analysis of our conversations, identify trends, and draw insights into how we communicate and collaborate as a team.
You can join our group from here.
We’ll examine the frequency of our messages, the topics we discuss, the patterns in our communication, and more. So, let’s get started and discover what our WhatsApp chats can reveal about our programming group.
Get the csv datasets of our chats from here.
import pandas as pd
df=pd.read_csv("book1.csv")
Pandas is a Python library used for data manipulation and analysis. This line of code imports the Pandas library and assigns it the alias “pd”.
This line of code reads the group chat data from a CSV file named “book1.csv” and stores it in a Pandas DataFrame named “df”.
df=df[['Date', 'Time', 'Phone', 'messages']]
This line of code selects only the “Date”, “Time”, “Phone”, and “messages” columns from the DataFrame and stores them in the same DataFrame “df”.
df=df.drop(["Phone"],axis=1)
This line of code drops the “Phone” column from the DataFrame “df” since it is not required for analysis.
df.info()
This line of code provides basic information about the DataFrame “df”, including the number of rows, columns, and data type of each column.
df['month'] = df['Date'].str.extract(r'-(d+)-')
df['year'] = df['Date'].str.split('-').str[-1]
These lines of code extract the “month” and “year” from the “Date” column of the DataFrame “df” using string manipulation.
df=df.drop(["Time"], axis=1)
This line of code drops the “Time” column from the DataFrame “df” since it is not required for analysis.
df["month"]=df["month"].fillna(0)
df["month"]=df["month"].astype(int)
df["year"]=df["year"].fillna(0)
df["year"]=df["year"].astype(int, errors='ignore')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
df['year'] = df['year'].fillna(0)
df['year'] = df['year'].astype(int)
These lines of code convert the “month” and “year” columns from string to integer data type and replace any missing or invalid values with 0.
df.groupby(["year"]).count()
This line of code groups the DataFrame “df” by “year” and counts the number of messages for each year.
import matplotlib.pyplot as plt
This line of code imports the Matplotlib library, which is used for data visualization.
df['date'] = pd.to_datetime(df['month'].astype(str) + df['year'].astype(str), format='%m%Y', errors='coerce')
df = df[(df['year'] != 1900)& (df['year'] != 1901)& (df['year'] != 1905)& (df['year'] != 1930)]
df.dropna(inplace=True)
These lines of code combine the “month” and “year” columns into a new “date” column, drop any rows with missing or invalid values, and convert the “date” column to datetime format.
df['date'] = pd.to_datetime(df['date'])
This code converts the ‘date’ column of the DataFrame to the datetime format using the pandas.to_datetime() function. This is necessary to be able to group the data by month later in the code.
msg_count = df.groupby(df['date'].dt.to_period('M'))['messages'].count()
This code groups the DataFrame by month using the groupby() method, and then counts the number of messages in each group using the count() method. The dt.to_period(‘M’) method is used to convert the datetime format of the ‘date’ column to monthly periods. The resulting msg_count variable is a pandas Series object.
msg_count.plot(kind='line', xlabel='Month', ylabel='Message Volume', title='Message Volume by Month')
plt.show()
This code creates a line chart of the message volume by month using the plot() method of the msg_count Series object. The kind parameter is set to ‘line’ to create a line chart, and the xlabel, ylabel, and title parameters are used to label the chart. Finally, plt.show() is called to display the chart.
msg_count.plot(kind='bar', xlabel='Month', ylabel='Message Volume', title='Message Volume by Month')
plt.show()
This code creates a bar chart of the message volume by month using the plot() method of the msg_count Series object. The kind parameter is set to ‘bar’ to create a bar chart, and the xlabel, ylabel, and title parameters are used to label the chart. Finally, plt.show() is called to display the chart.
import nltk
nltk.download('vader_lexicon') from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
def get_sentiment_score(message):
return sia.polarity_scores(message)['compound']
df['sentiment_score'] = df['messages'].apply(get_sentiment_score)
This code downloads the VADER lexicon from the nltk package, which is used for sentiment analysis. It defines a function called get_sentiment_score() which takes a message as input, computes its sentiment score using the SentimentIntensityAnalyzer object from the nltk.sentiment.vader module, and returns the compound score. Finally, it applies this function to each message in the ‘messages’ column of the DataFrame using the apply() method and adds the resulting scores as a new column called ‘sentiment_score’.
words = ["solve", "class 12th", "computer science", "class 11th", "sumita arora", "preeti arora", "CBSE", "python", "data science", "internships", "important questions", "board exams", "career"]def count_words(message):
return sum(message.lower().count(word.lower()) for word in words)
df["word_counts"] = df["messages"].apply(count_words)
grouped = df.groupby(["year", "month"])["word_counts"].sum()
words_series = df['messages'].str.split(expand=True).stack()
word_counts = words_series.value_counts()
common_words = word_counts[word_counts > 100].index.tolist()
year = 2021
df_year = df[df["year"] == year]
word_list = ["solve", "class 12th", "computer science", "class 11th", "sumita arora", "preeti arora", "CBSE", "python", "data science", "internships", "important questions", "board exams", "career"]
for word in word_list:
for month in range(1, 13):
df_year.loc[df_year["month"] == month, word] = df_year[df_year["month"] == month]["messages"].str.count(word)
df_year[word_list].sum().plot(kind='bar')
plt.title(f"Frequency of Words in {year}")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()
This code performs data analysis on a dataset that contains messages.
The first part of the code defines a list of words to search for in the messages column of the DataFrame. It then defines a function count_words
that takes a message as input and returns the number of occurrences of the words in the words
list in the message.
The code then adds a new column to the DataFrame called word_counts
using the apply()
method, which applies the count_words
function to each message in the DataFrame. It then groups the DataFrame by year and month and calculates the sum of the word_counts
for each group. The resulting grouped data is stored in the grouped
variable and printed.
In the next section, the code creates a series of all words in the messages
column by splitting each message into individual words and stacking them into a single column. It then counts the frequency of each word using the value_counts()
method and creates a list of words that occur more than 100 times called common_words
.
The final section of the code sets the year to analyze, filters the DataFrame to include only rows for that year, and creates a list of words to analyze. It then loops through each word in the list and each month of the year, creating a new column for each word that counts the number of times it appears in the messages
column for that month. Finally, it creates a bar chart showing the frequency of each word for each month of the given year. The resulting plot is displayed using the plt.show()
function.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation# clean and preprocess the text data
stop_words = set(stopwords.words('english'))
df['clean_messages'] = df['messages'].apply(lambda x: ' '.join([word.lower() for word in x.split() if word.lower() not in stop_words]))
# create the document-term matrix using TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, min_df=2, use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['clean_messages'])
# extract topics using LDA
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_model.fit(tfidf_matrix)
topics = lda_model.transform(tfidf_matrix)
df['topic'] = topics.argmax(axis=1)
# print the top words for each topic
for topic_idx, topic in enumerate(lda_model.components_):
print(f'Top words for topic {topic_idx}:')
print([tfidf_vectorizer.get_feature_names()[i] for i in topic.argsort()[:-6:-1]])
Cleaning and processing text data using NLTK and performing topic modeling using LDA. Specifically, it performs the following steps:
- Importing necessary libraries including pandas, nltk, and sklearn.
- Cleaning the text data by removing stop words and converting all words to lowercase.
- Creating a document-term matrix using TfidfVectorizer.
- Extracting topics using Latent Dirichlet Allocation (LDA) algorithm.
- Assigning topics to each message and storing them in the
df
dataframe. - Printing the top words for each topic.
df['date'] = pd.to_datetime(df['date'])
df.groupby([df['date'].dt.year, df['date'].dt.month, 'topic'])['messages'].count().unstack().plot(kind='bar', stacked=True)
Plotting the frequency of each topic over time. Specifically, it performs the following steps:
- Converting the
date
column in thedf
dataframe to a datetime format. - Grouping the data by year, month, and topic and counting the number of messages for each group.
- Unstacking the dataframe to create a plot where each topic is represented by a different color.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score# Convert the date column to a datetime object
df['date'] = pd.to_datetime(df['date'])
# Group the data by month and count the number of messages in each group
message_count = df.groupby(pd.Grouper(key='date', freq='M')).size()
# Visualize the message volume by month using a line plot
message_count.plot(kind='line', figsize=(10, 6))
plt.title('Message Volume by Month')
plt.xlabel('Month')
plt.ylabel('Message Count')
plt.show()
This code performs data analysis and visualization on a dataset that includes a “date” column and a “messages” column. Here is a breakdown of the code:
- Importing the necessary libraries: NumPy, scikit-learn’s LinearRegression, train_test_split, and r2_score functions.
- Convert the “date” column in the dataframe df to a datetime object using pd.to_datetime() method.
- Group the data by month and count the number of messages in each group using the groupby() method with pd.Grouper() method to group the data by the month frequency.
- Visualize the message volume by month using a line plot with the plot() method.
- Set the plot’s title, x and y axis labels, and figure size using the appropriate methods.
- Display the plot using the show() method.
The resulting plot displays the message count on the y-axis and months on the x-axis, providing a visual representation of the message volume over time. This plot allows the viewer to quickly identify any patterns or trends in the data.
df['date'] = pd.to_datetime(df['date'])message_count = df.groupby([df['date'].dt.year, df['date'].dt.month])['messages'].count()
message_df = pd.DataFrame({'year': message_count.index.get_level_values(0),
'month': message_count.index.get_level_values(1),
'message_count': message_count.values})
print(message_df)
X = message_df[['year', 'month']]
y = message_df['message_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
df['date'] = pd.to_datetime(df['date'])
– This line converts a column called “date” in a DataFrame calleddf
to adatetime
object.message_count = df.groupby([df['date'].dt.year, df['date'].dt.month])['messages'].count()
– This line groups the DataFrame by year and month and counts the number of messages for each group. The resulting data is stored in a variable calledmessage_count
.message_df = pd.DataFrame({'year': message_count.index.get_level_values(0), 'month': message_count.index.get_level_values(1), 'message_count': message_count.values})
– This line creates a new DataFrame calledmessage_df
from themessage_count
data. Themessage_df
DataFrame has three columns:year
,month
, andmessage_count
.X = message_df[['year', 'month']]
– This line creates a new DataFrame calledX
containing theyear
andmonth
columns frommessage_df
.y = message_df['message_count']
– This line creates a new Series calledy
containing themessage_count
column frommessage_df
.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
– This line splits the data into training and test sets using thetrain_test_split
function from scikit-learn. The training set contains 80% of the data, and the test set contains 20% of the data.model = LinearRegression()
– This line creates a newLinearRegression
object calledmodel
.model.fit(X_train, y_train)
– This line fits the model to the training data using thefit
method.y_pred = model.predict(X_test)
– This line uses the fitted model to make predictions on the test data using thepredict
method. The predictions are stored in a variable calledy_pred
.
r2 = r2_score(y_test, y_pred)
print('R2 score:', r2)future_data = pd.DataFrame({'year': [2024, 2024, 2025], 'month': [1, 2, 3]})
future_predictions = model.predict(future_data)
print('Future predictions:', future_predictions)
r2_score
function fromsklearn.metrics
module is used to calculate the R2 score or coefficient of determination, which is a statistical measure that represents how well the regression model fits the actual data.- Here,
r2_score
function is applied on the test set (y_test
andy_pred
) to calculate the R2 score and the result is printed. pd.DataFrame()
method is used to create a new dataframe calledfuture_data
, which contains 3 future dates (2024-01-01, 2024-02-01, and 2025-03-01).model.predict()
method is applied on thefuture_data
to predict the message count for those future dates and the result is stored infuture_predictions
variable.- Finally, the
future_predictions
value is printed on the console.
The output of this code suggests that the R2 score is only 49%, which indicates that the model fits the data only moderately well. Also, since the dataset is not enough to make accurate predictions, the future_predictions
values are negative.
Download the complete code from here
Thank you for reading!!
Subscribe my other newsletters (it’s FREE):
Data Science
Growth Mindset
To know about me more click here