
Do you play Mobile Legends Bang Bang on a smartphone?
In my country (Indonesia) this game is one of the popular games on a mobile platform. In this article, I will try to do data analysis and sentiment analysis on the Mobile Legends Bang Bang reviews on the Google Play Store, specifically in Indonesia.
For full code in Python: You can visit here.
Table of contents:
- Retrieving Data from the Google Play Store
- Exploratory Data Analysis
- Preprocessing
- Modeling
- References
To get the reviews from the Google Play Store, I used the google-play-scapper API. I take the 10000 newest reviews from Indonesia and the reviews taken in the Bahasa.
def get_reviews(id, number_of_reviews = 200):
review_, _ = reviews(
id,
lang='id', # english language
country='id', # indonesia country
sort=Sort.NEWEST, # take the newest reviews
count=number_of_reviews
)review_ = pd.DataFrame(review_)
review_ = review_[[
'content',
'score'
]]
return review_
mobilelegends_reviews = get_reviews(id='com.mobile.legends', number_of_reviews=10000)
mobilelegends_reviews.sample(5)
The data contain 10.000 newest reviews on the Google Play Store, consisting of content (the review) and score (rating from the user).
In this exploratory I will show two analyses, using a count plot from Seaborn to see the distribution of the score and using Worldcloud.
Score Distribution
plt.figure(figsize=(6,5))
sns.countplot(data=mobilelegends_reviews, x='score')
The game itself gets a score of 5 the most scores from the recent players, followed by scores 1, 4, 3, and 2.
Wordcloud
I will make different word cloud groups by positive, neutral, and negative ratings as described below:
- positive rating = score of 4 and 5
- neutral = score of 3
- negative rating = score of 1 and 2
data['sentiment'] = data['rating'].apply(lambda rating: 2 if rating > 3 else (1 if rating == 3 else 0))# positive negative & neutral sentiment:
positive = data[data['sentiment'] == 2]
negative = data[data['sentiment'] == 0]
neutral = data[data['sentiment'] == 1]
Positive Rating Wordcloud
This word cloud is from the reviews with scores of 4 and 5 mostly saying good, cool, and exciting.
Negative Rating Wordcloud
This word cloud is from the reviews with scores of 1 and 2 mostly saying about the dark system, team, loss, and even in the bad score there is a ‘good’ word. As you know ‘dark system’ is a term the player uses when they always get a bad team and they lose the game, or when they always get a good opposing team (experienced players/ pro players) and they lose.
Neutral Rating Wordcloud
This word cloud is from the reviews with scores of 3 mostly saying about the good, network, rank, and team, as you can see there is a ‘dark system’ too. The ‘network’ may mean that the game needs a stable wifi/ cellular network, so when the player doesn’t get a good network the game will just lag/crash. In Indonesia a good wifi/ cellular network is mostly in the city, but in the village mostly not very good. As we know this is a multiplayer mobile game with heavy graphics.
To make a model using reviews first step is to vectorize every review on the dataset, which means converting a collection of text documents to a matrix of token counts.
To do this, we can use the CountVectorizer() function from the scikit-learn library. Then we can split it into X and y features to fit it into the machine learning model.
# Get the tokenizer pattern with contraint a-z, A-Z, 0-9
token = RegexpTokenizer(r'[a-zA-Z0-9]+')# Convert a collection of reviews to a matrix of token counts
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
X = cv.fit_transform(data['content'])
y = data['sentiment']
Then because there is an imbalance within the independent variable, I will use an oversampling function from the Imblearn library called SMOTE(). This function will generate synthetic samples from the minority class (in our case neutral rating and negative rating). After that, we split the data into train and test data like always.
# Instantiate the SMOTE object
smote = SMOTE()# Perform oversampling
X_oversampled, y_oversampled = smote.fit_resample(X, y)
#Train Test split
X_train, X_test, y_train, y_test = train_test_split(X_oversampled,
y_oversampled,
test_size=0.15,
random_state=17,stratify=y_oversampled)
For this case, I will use three machine-learning models, because the task is to classify the sentiment, this will be a classification model. The models are Logistic Regression, Random Forest, and Xtreme Gradient Boosting:
Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
print(classification_report(y_pred, y_test))
Random Forest
model = RandomForestClassifier()
model.fit(X_train, y_train)# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
print("Random Forest Model")
print(classification_report(y_pred, y_test))
Xtreme Gradient Boosting
For using XGB, first data needs to be changed to DMatrix data structure which is used in the XGB model library.
# Create DMatrix for training and testing data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)# Set the parameters for XGBoost
params = {
'objective': 'multi:softmax', # Set the objective function for multi-class classification
'num_class': 3, # Number of classes in the dataset
'eval_metric': 'merror', # Evaluation metric (multi-class classification error rate)
'eta': 0.4, # Learning rate
'max_depth': 6, # Maximum depth of a tree
'subsample': 0.8, # Subsample ratio of the training instances
'colsample_bytree': 0.8, # Subsample ratio of features when constructing each tree
'seed': 42 # Random seed for reproducibility
}
# Train the XGBoost model
num_rounds = 100 # Number of boosting rounds
model = xgb.train(params, dtrain, num_rounds)
# Make predictions on the testing data
preds = model.predict(dtest)
pred_labels = [int(pred) for pred in preds]
print("Xtreme Gradient Boostin Model")
print(classification_report(pred_labels, y_test))
Recap
As we can see from the table above the XGB model outperform other model in all matrixes, so we can use and save the XGB model.
Let’s try something fun, I will input some new reviews, and using the model I will predict the sentiment of the reviews.
review = [
"Dapat tim yang tidak bisa main, kalah terus",
"Pilihan hero yang banyak dengan skil yang unik, game keren",
"Game nya susah dimainkan, perlu latihan beberapa kali baru hapal",
"Dari game ini udah beberapa kali ikut turnamen dan menang dapat hadiah menarik",
"Keluar hero baru dengan skill yang keren, selalu update"]
transf_rev = xgb.DMatrix(cv.transform(pd.Series(review)))pd.DataFrame({'review': review,
'pred_sentiment': list(model.predict(transf_rev))})
Well, the model works quite well. The reviews with some positive vibes get a sentiment prediction of 2 (positive) and some with negative vibes get a sentiment prediction of 0 (negative). This model can be used to predict user experience in the game based on their review/s.