![](https://crypto4nerd.com/wp-content/uploads/2023/06/0KviCHTbRDAJdn-JA.png)
SageMaker (SM) Pipelines was released end of 2020 at re:invent. Essentially, it’s a workflow orchestrator for SageMaker Jobs.
If you’ve worked with other workflow orchestrators such as Apache Airflow or AWS Step Functions, the general concept should be fairly familiar to you. While SageMaker Pipelines is considered a workflow orchestrator, the key difference is that it is aimed at a fairly narrow use case — namely orchestrating workflows that aim to perform training or inference using a Machine Learning model.
Do you want to dive deeper into SageMaker Pipelines? Check out the official AWS documentation and have a look at one of my previous blog posts, in which I give a longer intro to SM Pipelines, including some tipps on improving their robustness and efficiency.
SageMaker Pipelines certainly isn’t the most mature workflow orchestrator. It has rough edges, the docs are fragmented, APIs are inconsistent at times and that’s not all, unfortunately.
The primary reason why organisations are using it heavily is not because it’s a best-in-class tool, but because of it’s excellent integration with the rest of the SageMaker ecosystem and beyond (Lambda, SQS …).
And that’s fair! Also, the SageMaker Pipelines team is working hard to continuously improve their product. From a user perspective, the pace at which new features, integrations and usability improvements are released is impressive.
SageMaker Pipelines is what it is and it’s getting better fast. Yes, there’re good parts. Yes, there’re painful elements. One specific pain point is particularly annoying.
What is it and how to fix it? This is what we’re going to look into.
Imagine the following: You’re building a training (SageMaker) pipeline. Since you’re training your model on a couple of hundred of Gigabytes of data, your data loading, preprocessing and feature engineering steps are all Spark jobs, each running on an elastically provisioned, beefy cluster. Thanks to PySparkProcessor
and ProcessingStep
, all it really takes is to write the actual code that should run in each step and declare the appropriate sequence of steps as you want them to be executed in your SageMaker Pipeline. After your heavy lifting has been done in a few Spark jobs, you’re then defining an Estimator
and passing it to a TrainingStep
. Training is followed by a ConditionStep
, which evaluates whether the achieved validation score is good enough to have training followed by a RegisterModel
step, which checks the resulting trained model artifact into the SageMaker Model Registry. Finally, your SageMaker Pipeline runs a LambdaStep
to let a downstream service know that your new model has been checked in and is ready to be consumed.
Sounds like quite a standard pipeline, doesn’t it? It is, indeed, and SageMaker Pipelines makes it quite easy to accomplish this without worrying about any underlying compute (and more). Sounds great!
So where’s the pain?
The pain is really during development and debugging. Each step takes time to launch; they take a few minutes (if you scoped down your data for development — please, do that!) or hours to run… Only to see that it has failed after multiple steps have run successfully, potentially wasting tens of minutes or hours because of quite annoying reasons, such as…
- … you had a typo in the KMS Key ID you set as part of the
Estimator
to enrypt your trained model artifact. - … you forgot to set the correct
NetworkConfig
in your second Spark job, so your nodes can’t communicate. - … the Lambda function you intended to trigger in a
LambdaStep
was renamed a few days ago. - … a docker image on ECR that you referenced in your third
PySparkProcessor
had been removed due to security vulnerabilities.
And there’re many, many more reasons why your SageMaker Pipeline was already set up for failure even before you started it. No linter can help you catch these bugs. Discovering a failure after tens of minutes or hours of runtime although you could’ve seen and fixed it without even triggering the pipeline is bad. It’s frustrating, a waste of time and resources. It’s simply a pain.
Let’s take the perspective of a ML Platform team. Such a team provides an internal developer platform to speed up data-science-powered development teams. When building ML platforms, such a team would usually look into the typical user journeys and aim to define processes and build tooling with the goal of streamlining and automating how models are built, deployed and served to speed up the work of development teams and reduce thir cognitive load.
Narrowing down our view to SageMaker Pipelines again, an ML Platform team could, for example, decide to build a centralized build and deployment workflow that would take away the burden of deploying SageMaker Pipelines from development teams. Building such a centralized workflow requires making opinionated choices how build and deployment should be done. These choices may also influence how development teams are required to build their SageMaker Pipelines. It’s a contract: “Our CI/CD pipelines will do the deployment for you IF you adhere to these 5 things in your SM Pipeline…”. For example, this could entail including a certain pipeline Parameter
in your pipeline definition or using a specific set of S3 buckets as artifact stores.
So where’s the pain?
Documenting such a contract very, very well is indispensable. The development teams — the users of your platform — must clearly understand what they need to do in order to use your product without running into issues. In addition to good documentation, some training and initial hand-holding/enabling might be wise as well.
As much as we’d want to believe that this would be enough, be assured, it is not. Teams will not follow or misinterpret the contract, wonder why the deployment failed and come back to the ML platform team being unhappy with the product you’ve provided them. And they are right. It’s a pain!
What both described scenarios have in common is that the SageMaker Pipeline object does not match our expectations.
More recently, “local mode” was introduced for SageMaker Pipelines. To some extent, this helps doing a kind-of dry-run of your pipeline locally before upserting it. Although it’s a great improvement, it’s not a silver bullet:
- Running an entire pipeline locally is still time consuming, even if data is scoped down.
- If you’re working with sensitive data (e.g. PII), chances are you do not have access to any data locally. Perhaps not even in the cloud without deploying to a controlled environment. That means lots of mock-data creation to get a full pipeline run through.
- Interaction with other cloud services from your pipeline is not possible or difficult to test when running in local mode.
- Local mode only supports a limited number of step types. Got a
LambdaStep
in your pipeline? Nope, can’t do (at the moment).
Summing up, in essence, although local mode helps us a bit, there isn’t really a way to find out that your SageMaker Pipeline does not match your expectations without deploying and/or running that pipeline on the managed SageMaker Pipeline service. Consequently, mistakes remain expensive.
Better validate your expectations before your deploy or run your SageMaker Pipeline!
How can we do this completely offline, i.e. without upsert
ing or start
ing the SM Pipeline?
Validating your SageMaker Pipeline from scratch
A SageMaker Pipeline is a Python object.
This object holds everything that defines your pipeline. Let’s look at our very simple example pipeline object a bit closer. The top level attributes are as follows:
Each of these attributes is either an iterable, primitive or another object defined by the sagemaker Python SDK. We can dig down this tree more and more, realizing that a fully fledged SageMaker pipeline is quite a complex, nested object. However, once understood how to navigate it and where to find what, it allows us to peak into what’s going to happen before it happens — i.e. before the pipeline is actually upserted or executed — by inspecting that object.
For example, let’s have a closer look at the first step of our pipeline:
Seeing that this step is of StepTypeEnum.PROCESSING
and knowing a bit about how the sagemaker SDK is structured, we realize we can have a closer look at the Processor
that is used in this Step:
Now there’s some interesting stuff! image_uri
, arguments
, output_kms_key
, volume_kms_key
, network_config
, role
and more. All these are attributes of which you’ll only realize that they were set incorrectly once the step that contains this specific processor has started to run in your SageMaker Pipeline. Too bad if it’s the third step, your pipeline is also 2 hours in and it says “Can’t find image {image_uri}
on ECR!”.
In order to avoid this, we can validate these properties against their expected value after having created an instance of the SageMaker Pipeline:
- Define your expected value
- Know where to find the property you want to validate in the SageMaker Pipeline object.
- Fetch the observed value from there.
- Compare expected to observed value.
Done!
Probably you’re not going to do this often if you have to do it from scratch, though. It’s quite tedious, requires thorough knowledge of the SageMaker Pipeline object, the not quite streamlined API will make your life harder and force you to regularly dig into the sagemaker Python SDK.
Using sagemaker-rightline to validate your Pipeline
Since I’ve been working with SageMaker Pipelines for a while, I’ve repeatedly observed people (including myself) running into the problems laid out in this article. At some point, I started looking into how to offline validate a SM Pipeline object, starting pretty much as described above. This grew from a few evenings of exploration into a Python package: sagemaker-rightline
— Helping to get your SageMaker Pipeline right since 2023!
It’s open source, hosted on GitHub and available via PyPI:
pip install sagemaker-rightline
> Please, note that it’s in alpha state and it’s being fairly actively developed. > Some Step Types are not yet supported.
> Contributions are more than welcome!
What is it?
In essence, it’s a framework that…
- makes it very easy to create a collection of Validations you want to run against your SageMaker Pipeline object, resulting in a report that you can act upon.
- makes it quite straightforward to develop new Validations, abstracting away a fair bit of the overhead you’d have when validating your pipeline object from scratch.
How can I use it?
Source code: https://github.com/stiebels/sagemaker-rightline
PyPI: pip install sagemaker-rightline
(from here)
First, let’s import the Configuration
class:
from sagemaker_rightline.model import Configuration
It’s the entry point for the user. The Configuration
class accepts two arguments:
- A SageMaker Pipeline object that should be validated:
sagemaker.workflow.pipeline.Pipeline
- A
List
ofValidation
objects, which specify what should be validated.
The Validation
abstract base class defines a contract for specific Validations, such as StepImagesExistOnEcr
, StepKmsKeyId
or PipelineParameters
(see the README.md for the full list of currently implemented Validations). Depending on the specific Validation, they might require further input.
Some might be happy with nothing, such as StepImagesExistonEcr
, which simply finds all your image_uri
values in your steps and checks that they indeed exist on ECR (you can pass a configured boto3 client if it’s cross-account). Others, such as StepKmsKeyId
require you to specify an expected value, a rule (e.g. Equals
or Contains
— as is or inverted) as well as, optionally, a specific step_name
if you want to run the validation only against a specific step instead of all steps.
from sagemaker_rightline.validations import (
StepImagesExistOnEcr,
StepKmsKeyId,
PipelineParameters,
)
from sagemaker_rightline.rules import Equalsvalidations = [
StepImagesExistOnEcr(),
PipelineParameters(
parameters_expected=[
ParameterString(
name="parameter-1",
default_value="some-value",
),
],
rule=Contains(),
),
StepKmsKeyId(
kms_key_id_expected="some/kms-key-alias",
step_name="sm_training_step_sklearn", # optional: if not set, will check all steps
rule=Equals(),
),
]
Once we’ve decided on the validations we want to run against our SageMaker Pipeline and assuming our pipeline object is sm_pipeline
, we can instantiate Configuration
:
config = Configuration(
validations=validations,
sagemaker_pipeline=sm_pipeline,
)
The Configuration
class exposes a single public method: Configuration.run
. Calling run
results in running the passed List
of Validation
objects against the Pipeline
and returns a Report
object, which holds a List
of ValidationResult
objects. Optionally, by setting return_df
to True
, a pandas.DataFrame
is returned instead of Report
object.
report = config.run(return_df=True)
report
Based on the report, you can handle failed validation results as you see fit. Optionally, you can also set fail_fast
to True
and have the actual configuration run raise a ValidationFailedError
as soon as any validation is unsuccessful.
Check out the examples in the GitHub repository for a end-to-end runnable notebook or script.
Happy validating!