Architecting a Real-Time Machine Learning Inference Application | by Frank Adams

In real-world scenarios, Machine Learning (ML) models are rarely standalone components. They are typically part of a larger system that involves preprocessing, business logic, and post-processing. This article explores two common approaches to architecting such systems in production.

𝗔: You can think of the final deployable system as follows:

𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 = Preprocessing + Business Logic + ML Model + Post-Processing + Business Logic

In many cases, it is likely to extend to:

𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 = Preprocessing + Business Logic + ML Model + Post-Processing + Business Logic + Additional ML Model + …

In the real-world production environment, you will find two common ways this system is architected

𝗕: Single Service Deployment

The most straightforward approach to package the deployable system is to combine all additional processing logic with the ML model and deploy it as a single service.

Here’s how it would work for a request-response type of deployment:

The backend service calls the ML service exposed via gRPC.
The ML service retrieves features from a feature store. Preprocessing and additional business logic are applied to these features.
The processed features are then fed into the ML model.
The inference results are subjected to additional post-processing and business logic.
The final results are returned to the backend service and can be used in the product application.

𝗖: Business Logic Decoupled from the ML Model

Another approach involves introducing a separate service that sits between the backend product service and the service exposing the ML model. This architecture is often referred to as “The Blender” because it facilitates blending, which combines multiple ML model inference results (referred to as “refers to D”) to provide more powerful statistical predictions.

Here’s the breakdown of the diagram:

1: The backend service calls the service containing business logic rules, which is exposed via gRPC (referred to as “The Blender”).

2: The service containing business logic rules calls the ML service exposed via gRPC.

3–5: Same as steps 2–4 in example B.

6: The Blender receives the results and applies business rules to the inference results.

7: The final results are returned to the backend service and can be utilized within the product application.

This architecture becomes valuable when multiple individuals work on the pipeline and various combinations of model versions and business logic need to be tested in production. In a future long-form newsletter issue, I’ll delve deeper into this topic, so stay tuned!

Source link