![](https://crypto4nerd.com/wp-content/uploads/2023/12/13hk9Vv_bRBkSaSFH9UotLg.png)
By Elisavet Palogiannidi, Senior Machine Learning Engineer Light & Wonder iGaming
Picture this:
As a data science team member you are asked to develop a Machine Learning (ML) model to address your company’s needs. You open a jupyter notebook, and start writing some functions about analyzing and visualizing data as preliminary steps before developing the ML model. When queried about the model’s readiness for deployment, you present your jupyter notebook and assert, ‘Yes.’ But is this true?
In reality, the process of creating production ready ML models entails far more than writing functions on jupyter notebooks. Software skills and effective collaboration can make a significant impact to the value and the maintenance of the models (Production ready ML models are, actually, software applications!). Data science teams with ML products in the works, should establish a workbench with tools and processes that are followed and respected by all the team members. Accelerating daily activities and effectively adopting ML and collaboration best practices should be the be-all and end-all of such teams.
How can software skills save you time?
Starting a new project often involves reusing code that has already been implemented in previous projects. This may include setting up of loggers, creating runner wrappers, generating documentation and utilizing other auxiliary functions. Building these functionalities from scratch can consume a significant amount of time and effort and it may contribute to code inconsistencies across projects. Fortunately, all these challenges can be mitigated by establishing a workbench for rapidly initiating new projects. In a nutshell, a template project encompasses common functionalities, including environment specifications that can be inherited by any subsequent project.
The structure of the template project holds significant importance. A well-defined project structure helps newcomers and other developers understand the project without needing to delve into extensive documentation. It also eliminates the need to read and understand 100% of the code to find specific functionality. Well organized codebases facilitate collaboration and the reproduction of analyses results and models predictions. In the following figure, a proposed template project structure is shown.
Directories are used to semantically organize the codebase into meaningful groups. Building upon this project structure, more directories can be added to include models, log files, (sample) data, semantically related pieces of code and more. The remaining files that reside on the root directory are related to various configuration instructions, developers documentation files (README.md, CHANGELOG.md, CONTRIBUTING.md) and environment management files (Pipefile, .env_file_template).
Additionally, using the right set of tools can save you time during development. Linters (analyze code and flag errors), code formatters (automatically refactor code to perform with specific python conventions) and type checkers (checks the correctness of variables based on annotations) can be your “coding buddies’’ helping you write correct, readable, and consistent code. Integrating them into pre-commits acts as your “coding police ‘’ helping you catch typos or errors that may have escaped your notice and reducing back-and-forth commits. Moreover, unittests, provide a testing mechanism that can uncover bugs that might be difficult to detect otherwise, ensuring that the code remains stable in specific environments. These tools can be set up and executed through make commands that are organized in Makefiles.
How can software skills save your models?
It is a common phenomenon for data scientists to focus on algorithms, hyperparameters tuning, performance metrics and similar aspects, while prioritizing the adoption of good software engineering practices or more strict coding standards secondarily.
At the end of the day, what matters is deploying machine learning models and offering them as business products for as long as possible. Writing not complex and not complicated code increases the likelihood of the model becoming a maintainable and extensible solution for the business. Otherwise, the models are destined to be phased out sooner or later. Since codes and models are maintained by humans, project design and structure should cater to their benefit. As Martin Fowler aptly stated back in 2008: “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”
The key lies in automating the project initiation process, code correctness and testing mechanisms, which allows more time to focus on developing ML functionality. A comprehensive discussion of additional strategies for writing high-quality code, such as striving for clarity and conciseness, adhering to naming conventions, avoiding monolithic scripts, and logically and semantically grouping functionality, is a topic beyond the scope of this article.
How to apply all these to your work?
First and foremost, select your technology stack. It is important to choose state of the art tools with robust technical support that can scale effectively. Next, encapsulate the selected tools within a code infrastructure that aligns with your needs. Figure 2 illustrates an infrastructure that covers the stages of model development culminating in code deployment to a cloud repository. The backbone of this infrastructure is the template project, the structure of which is depicted in Figure 1. However, it’s worth noting that a modern AI tech stack comprises a more extensive range of technologies and a more complex infrastructure, addressing various aspects of the model lifecycle, including data management, model management, and model deployment. An in-depth exploration of these aspects is beyond the scope of this article.
The shown infrastructure is encapsulated within Pycharm, a cross-platform IDE with a rich set of features and tools that accelerate development. Pycharm seamlessly integrates all development stages, including code writing, debugging, testing and even version control within a single environment. This infrastructure can be divided into three main components:
Code Development
- Written in python and utilizes pipenv for dependencies management. Input parameters are passed to the project through yaml configuration files.
Code Testing
- Orchestrated by Makefile and includes:
- Unittests to ensure code stability and reliability.
- Code quality checks.
Version Control
- A branching model with master branch serving as the central point is followed.
- New functionality is added to feature branches.
- Pre-commit and pre-push mechanisms are employed to guarantee code quality and correctness.
At the heart of the chosen technology stack lie several key components, including Python, PyCharm, Pipenv, Bash, Git, Flake8, Black, MyPy, Pytest, and Sphinx. This infrastructure can be further enhanced by incorporating additional steps and technologies relevant to the models’ lifecycle or the deployment of the application.
Key takeaways
- Machine learning applications are fundamentally software, so software skills matter.
- Stay up-to-date with state-of-the-art technologies for your models and products.
- Automate, automate, automate!
The opinions expressed in this blog post are strictly those of the author in their personal capacity. They do not purport to reflect the opinions or views of Light & Wonder or of its employees.
WE’RE HIRING
Find out more about life at Light & Wonder iGaming and our career opportunities : https://igaming.lnw.com/careers/