Preparing for the Databricks Certified Machine Learning Associate | by Hardik Rathod

This article is for you if you are starting your journey towards this exam and are looking for some tips on how to prepare for it and ace it.

Let me start by sharing this official certification page where you can find all the details like the number of questions, format, duration, course, etc. that you will need to know for your preparation. Just for an overview: The exam is 90 minutes long with 45 MCQ quesions. I took the online version of it which was fairly easy to navigate and give the exam. You will need to install a software which will require admin rights so i would suggest to do it from your personal laptop. You will be able to go back and forth between questions and have an option to mark and review all questions at any point in the exam before you submit it. I can say that on an approximate basis, the question will be distributed as follows: 30% code(mlflow, feature store, delta tables), 50% on Databricks concepts(mlflow, feature store, repos, pandas, automl, etc.), 10% on Databricks UI and 10% of Delta tables

The main intention behind writing this blog is to share my personal experience in terms of what topics I feel are really important and you should definitely have a read before you take your attempt. These topics are mostly covered in the course videos but there are definitely some which need more attention.

Inference on Databricks
Decision Trees: maxBins, why no One Hot Encoding?
InternalFrame
MapInPandas
Pandas API on Spark
Code Questions: filter, evaluators, transformer/estimators, dbutils.data.summarize
Repos: clone, pull, push, branch, etc
Versioning of data and models: what happens when you register a new model and when you register a pre-registered model
Databricks imputers: especially for categorical features(stringIndexer and OHE)
AutoML: where to find the best model and the EDA notebook
Scikit-learn: single node model and what things can/cannot work with it
Feature Store: create table, score_batch
Autolog : nested runs
MLflow : search_run, default evaluation metric for regression/classification
Hyperopt : Algos used by it and how to use it with Databricks
SparkTrials : Tradeoff between parallelism and max_evals.

Some general pointers for the exam,

All of the questions I got in my exam where just MCQ with one correct answer. However, I did not find any statement mentioning that there will be MCQ with multiple correct answers. One of the things on which i cannot emphasise more is to stay focused on the “Scalable Machine Learning” course for this exam. Like, literally all the functions, little notes/functions in the notebook and the comments mentioned by the instructor is everything you need to pass this exam.

Another important thing to keep in mind is to study the UI too. I know it sounds a bit lame but, there were definitely around 2–3 questions based on the UI. The format of these questions is like on which page can you find the model version in the UI or from which page can you see the models and their features together.

There will be code questions so I would suggest going through different methods available for mlflow and feature store clients. In these questions you might have to:

asses a given code for correctness
fill in some blank spaces in the code
suggest corrective measures for a given task

If you’ve made it this far in the article, then you definitely deserve a

Bonus point : The exam suggests taking just the Scalable Machine Learning course for this certification, but I would highly recommend going through the “Experimentation” section of the Machine Learning in Production(v2) course. This will definitely give you an edge in the exam.

All the best !!!

Source link