From July 2020 to Feb 2023: A Comprehensive Analysis of Our Programming Group’s Chats | by Shivam Maurya

Over the past few years, WhatsApp has become one of the most popular messaging platforms used by programming groups worldwide. With the rise of remote work and the need for effective collaboration, these groups use WhatsApp to stay connected, share ideas, and soeelve problems.

In this article, we’ll take a deep dive into the WhatsApp chats of our programming group from July 2020 to February 2023. Our aim is to conduct a comprehensive analysis of our conversations, identify trends, and draw insights into how we communicate and collaborate as a team.

You can join our group from here.

We’ll examine the frequency of our messages, the topics we discuss, the patterns in our communication, and more. So, let’s get started and discover what our WhatsApp chats can reveal about our programming group.

Get the csv datasets of our chats from here.

import pandas as pd
df=pd.read_csv("book1.csv")

Pandas is a Python library used for data manipulation and analysis. This line of code imports the Pandas library and assigns it the alias “pd”.

This line of code reads the group chat data from a CSV file named “book1.csv” and stores it in a Pandas DataFrame named “df”.

df=df[['Date', 'Time', 'Phone', 'messages']]

This line of code selects only the “Date”, “Time”, “Phone”, and “messages” columns from the DataFrame and stores them in the same DataFrame “df”.

df=df.drop(["Phone"],axis=1)

This line of code drops the “Phone” column from the DataFrame “df” since it is not required for analysis.

df.info()

This line of code provides basic information about the DataFrame “df”, including the number of rows, columns, and data type of each column.

df['month'] = df['Date'].str.extract(r'-(d+)-')
df['year'] = df['Date'].str.split('-').str[-1]

These lines of code extract the “month” and “year” from the “Date” column of the DataFrame “df” using string manipulation.

df=df.drop(["Time"], axis=1)

This line of code drops the “Time” column from the DataFrame “df” since it is not required for analysis.

df["month"]=df["month"].fillna(0)
df["month"]=df["month"].astype(int)
df["year"]=df["year"].fillna(0)
df["year"]=df["year"].astype(int, errors='ignore')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
df['year'] = df['year'].fillna(0)
df['year'] = df['year'].astype(int)

These lines of code convert the “month” and “year” columns from string to integer data type and replace any missing or invalid values with 0.

df.groupby(["year"]).count()

This line of code groups the DataFrame “df” by “year” and counts the number of messages for each year.

import matplotlib.pyplot as plt

This line of code imports the Matplotlib library, which is used for data visualization.

df['date'] = pd.to_datetime(df['month'].astype(str) + df['year'].astype(str), format='%m%Y',  errors='coerce')
df = df[(df['year'] != 1900)& (df['year'] != 1901)& (df['year'] != 1905)& (df['year'] != 1930)]
df.dropna(inplace=True)

These lines of code combine the “month” and “year” columns into a new “date” column, drop any rows with missing or invalid values, and convert the “date” column to datetime format.

df['date'] = pd.to_datetime(df['date'])

This code converts the ‘date’ column of the DataFrame to the datetime format using the pandas.to_datetime() function. This is necessary to be able to group the data by month later in the code.

msg_count = df.groupby(df['date'].dt.to_period('M'))['messages'].count()

This code groups the DataFrame by month using the groupby() method, and then counts the number of messages in each group using the count() method. The dt.to_period(‘M’) method is used to convert the datetime format of the ‘date’ column to monthly periods. The resulting msg_count variable is a pandas Series object.

msg_count.plot(kind='line', xlabel='Month', ylabel='Message Volume', title='Message Volume by Month')
plt.show()

This code creates a line chart of the message volume by month using the plot() method of the msg_count Series object. The kind parameter is set to ‘line’ to create a line chart, and the xlabel, ylabel, and title parameters are used to label the chart. Finally, plt.show() is called to display the chart.

msg_count.plot(kind='bar', xlabel='Month', ylabel='Message Volume', title='Message Volume by Month')
plt.show()

This code creates a bar chart of the message volume by month using the plot() method of the msg_count Series object. The kind parameter is set to ‘bar’ to create a bar chart, and the xlabel, ylabel, and title parameters are used to label the chart. Finally, plt.show() is called to display the chart.

import nltk
nltk.download('vader_lexicon')  from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
def get_sentiment_score(message):
return sia.polarity_scores(message)['compound']
df['sentiment_score'] = df['messages'].apply(get_sentiment_score)

This code downloads the VADER lexicon from the nltk package, which is used for sentiment analysis. It defines a function called get_sentiment_score() which takes a message as input, computes its sentiment score using the SentimentIntensityAnalyzer object from the nltk.sentiment.vader module, and returns the compound score. Finally, it applies this function to each message in the ‘messages’ column of the DataFrame using the apply() method and adds the resulting scores as a new column called ‘sentiment_score’.

words = ["solve", "class 12th", "computer science", "class 11th", "sumita arora", "preeti arora", "CBSE", "python", "data science", "internships", "important questions", "board exams", "career"]def count_words(message):
return sum(message.lower().count(word.lower()) for word in words)
df["word_counts"] = df["messages"].apply(count_words)
grouped = df.groupby(["year", "month"])["word_counts"].sum()
words_series = df['messages'].str.split(expand=True).stack()
word_counts = words_series.value_counts()
common_words = word_counts[word_counts > 100].index.tolist()
year = 2021
df_year = df[df["year"] == year]
word_list = ["solve", "class 12th", "computer science", "class 11th", "sumita arora", "preeti arora", "CBSE", "python", "data science", "internships", "important questions", "board exams", "career"]
for word in word_list:
for month in range(1, 13):
df_year.loc[df_year["month"] == month, word] = df_year[df_year["month"] == month]["messages"].str.count(word)
df_year[word_list].sum().plot(kind='bar')
plt.title(f"Frequency of Words in {year}")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

This code performs data analysis on a dataset that contains messages.

The first part of the code defines a list of words to search for in the messages column of the DataFrame. It then defines a function count_words that takes a message as input and returns the number of occurrences of the words in the words list in the message.

The code then adds a new column to the DataFrame called word_counts using the apply() method, which applies the count_words function to each message in the DataFrame. It then groups the DataFrame by year and month and calculates the sum of the word_counts for each group. The resulting grouped data is stored in the grouped variable and printed.

In the next section, the code creates a series of all words in the messages column by splitting each message into individual words and stacking them into a single column. It then counts the frequency of each word using the value_counts() method and creates a list of words that occur more than 100 times called common_words.

The final section of the code sets the year to analyze, filters the DataFrame to include only rows for that year, and creates a list of words to analyze. It then loops through each word in the list and each month of the year, creating a new column for each word that counts the number of times it appears in the messages column for that month. Finally, it creates a bar chart showing the frequency of each word for each month of the given year. The resulting plot is displayed using the plt.show() function.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation# clean and preprocess the text data
stop_words = set(stopwords.words('english'))
df['clean_messages'] = df['messages'].apply(lambda x: ' '.join([word.lower() for word in x.split() if word.lower() not in stop_words]))
# create the document-term matrix using TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, min_df=2, use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['clean_messages'])
# extract topics using LDA
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_model.fit(tfidf_matrix)
topics = lda_model.transform(tfidf_matrix)
df['topic'] = topics.argmax(axis=1)
# print the top words for each topic
for topic_idx, topic in enumerate(lda_model.components_):
print(f'Top words for topic {topic_idx}:')
print([tfidf_vectorizer.get_feature_names()[i] for i in topic.argsort()[:-6:-1]])

Cleaning and processing text data using NLTK and performing topic modeling using LDA. Specifically, it performs the following steps:

Importing necessary libraries including pandas, nltk, and sklearn.
Cleaning the text data by removing stop words and converting all words to lowercase.
Creating a document-term matrix using TfidfVectorizer.
Extracting topics using Latent Dirichlet Allocation (LDA) algorithm.
Assigning topics to each message and storing them in the df dataframe.
Printing the top words for each topic.

df['date'] = pd.to_datetime(df['date'])
df.groupby([df['date'].dt.year, df['date'].dt.month, 'topic'])['messages'].count().unstack().plot(kind='bar', stacked=True)

Plotting the frequency of each topic over time. Specifically, it performs the following steps:

Converting the date column in the df dataframe to a datetime format.
Grouping the data by year, month, and topic and counting the number of messages for each group.
Unstacking the dataframe to create a plot where each topic is represented by a different color.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score# Convert the date column to a datetime object
df['date'] = pd.to_datetime(df['date'])
# Group the data by month and count the number of messages in each group
message_count = df.groupby(pd.Grouper(key='date', freq='M')).size()
# Visualize the message volume by month using a line plot
message_count.plot(kind='line', figsize=(10, 6))
plt.title('Message Volume by Month')
plt.xlabel('Month')
plt.ylabel('Message Count')
plt.show()

This code performs data analysis and visualization on a dataset that includes a “date” column and a “messages” column. Here is a breakdown of the code:

Importing the necessary libraries: NumPy, scikit-learn’s LinearRegression, train_test_split, and r2_score functions.
Convert the “date” column in the dataframe df to a datetime object using pd.to_datetime() method.
Group the data by month and count the number of messages in each group using the groupby() method with pd.Grouper() method to group the data by the month frequency.
Visualize the message volume by month using a line plot with the plot() method.
Set the plot’s title, x and y axis labels, and figure size using the appropriate methods.
Display the plot using the show() method.

The resulting plot displays the message count on the y-axis and months on the x-axis, providing a visual representation of the message volume over time. This plot allows the viewer to quickly identify any patterns or trends in the data.

df['date'] = pd.to_datetime(df['date'])message_count = df.groupby([df['date'].dt.year, df['date'].dt.month])['messages'].count()
message_df = pd.DataFrame({'year': message_count.index.get_level_values(0),
'month': message_count.index.get_level_values(1),
'message_count': message_count.values})
print(message_df)
X = message_df[['year', 'month']]
y = message_df['message_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

df['date'] = pd.to_datetime(df['date']) – This line converts a column called “date” in a DataFrame called df to a datetime object.
message_count = df.groupby([df['date'].dt.year, df['date'].dt.month])['messages'].count() – This line groups the DataFrame by year and month and counts the number of messages for each group. The resulting data is stored in a variable called message_count.
message_df = pd.DataFrame({'year': message_count.index.get_level_values(0), 'month': message_count.index.get_level_values(1), 'message_count': message_count.values}) – This line creates a new DataFrame called message_df from the message_count data. The message_df DataFrame has three columns: year, month, and message_count.
X = message_df[['year', 'month']] – This line creates a new DataFrame called X containing the year and month columns from message_df.
y = message_df['message_count'] – This line creates a new Series called y containing the message_count column from message_df.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) – This line splits the data into training and test sets using the train_test_split function from scikit-learn. The training set contains 80% of the data, and the test set contains 20% of the data.
model = LinearRegression() – This line creates a new LinearRegression object called model.
model.fit(X_train, y_train) – This line fits the model to the training data using the fit method.
y_pred = model.predict(X_test) – This line uses the fitted model to make predictions on the test data using the predict method. The predictions are stored in a variable called y_pred.

r2 = r2_score(y_test, y_pred)
print('R2 score:', r2)future_data = pd.DataFrame({'year': [2024, 2024, 2025], 'month': [1, 2, 3]})
future_predictions = model.predict(future_data)
print('Future predictions:', future_predictions)

r2_score function from sklearn.metrics module is used to calculate the R2 score or coefficient of determination, which is a statistical measure that represents how well the regression model fits the actual data.
Here, r2_score function is applied on the test set (y_test and y_pred) to calculate the R2 score and the result is printed.
pd.DataFrame() method is used to create a new dataframe called future_data, which contains 3 future dates (2024-01-01, 2024-02-01, and 2025-03-01).
model.predict() method is applied on the future_data to predict the message count for those future dates and the result is stored in future_predictions variable.
Finally, the future_predictions value is printed on the console.

The output of this code suggests that the R2 score is only 49%, which indicates that the model fits the data only moderately well. Also, since the dataset is not enough to make accurate predictions, the future_predictions values are negative.

Download the complete code from here

Thank you for reading!!

Subscribe my other newsletters (it’s FREE):
Data Science
Growth Mindset

To know about me more click here

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity