![](https://crypto4nerd.com/wp-content/uploads/2024/02/1ovOKczM-NZ6c8qtHpCPxdA-1024x768.jpeg)
Predicting what champion someone is playing based on their match data
After relentless hours of Kaggle, hours upon hours stuck in tutorial hell, and a little too many hours of watching Ken Jee and Tina Huang, I finally decided to put my skills to use and start my first Data Science project. (Link to project repo)
I spent a bit of time thinking about what I should make my first project about.
Should it be a regression or classification problem?
What type of skills do I want to showcase and what do I want to learn?
What should the topic of the project even be about???
Then the answer came to me like a revelation: League of Legends.
Much of my teens had been spent playing this game. I had almost too much subject expertise on the game and it just so happened that Riot Games offers tons of APIs so I could access tons of data. Furthermore, my idea was to predict what champion someone was playing by looking at their match data.
I knew this was a pretty big endeavor, but I was determined to get it started.
The first few days were dauting, but very exciting nonetheless. I had the choice of either getting a pre-made dataset, or assembling my own… I of course chose to create my own…
The biggest obstacle I faced was the rate limit within Riot’s API. You can make a max of 100 requests every 2 minutes. If I wanted to assemble a dataset able to make any predictions off of, it was going to take time. I had to deal with a lot of other problems when using the API such as: Key errors, corrupt matches, and just random things not working. I guess this part of the project really honed in on the idea that 80% of a data scientists time is spent collecting and cleaning data. Not so sexy if I do say so myself…
After hours and hours of reading through documentation, dozens of youtube videos (shoutout iTero gaming and Beora), I finally managed to finish my data pipeline, assembled my dataset and was ready to begin the actual project. The finished dataset ended up being 72,000+ games, and a total of 700,000+ entires which I stored on a PostgreSQL database.
Thanks to several preemptive decisions I made early on when assembling the dataset (I manually defined a DataFrame to get from the JSON the API would return, and just skipped matches that would raise any errors), I didn’t have to deal with many null values so my data cleaning was limited to just dropping games that were remakes (a player was missing from either team so the game ended early).
I performed exploratory data analysis where I found several defining features of champions, that would be crucial if I were to build a model to predict what champion is being played.