![](https://crypto4nerd.com/wp-content/uploads/2023/07/1q9iH4lAvpXxWIeQE-16U-Q-1024x727.png)
This article is for you if you are starting your journey towards this exam and are looking for some tips on how to prepare for it and ace it.
Let me start by sharing this official certification page where you can find all the details like the number of questions, format, duration, course, etc. that you will need to know for your preparation. Just for an overview: The exam is 90 minutes long with 45 MCQ quesions. I took the online version of it which was fairly easy to navigate and give the exam. You will need to install a software which will require admin rights so i would suggest to do it from your personal laptop. You will be able to go back and forth between questions and have an option to mark and review all questions at any point in the exam before you submit it. I can say that on an approximate basis, the question will be distributed as follows: 30% code(mlflow, feature store, delta tables), 50% on Databricks concepts(mlflow, feature store, repos, pandas, automl, etc.), 10% on Databricks UI and 10% of Delta tables
The main intention behind writing this blog is to share my personal experience in terms of what topics I feel are really important and you should definitely have a read before you take your attempt. These topics are mostly covered in the course videos but there are definitely some which need more attention.
- Inference on Databricks
- Decision Trees: maxBins, why no One Hot Encoding?
- InternalFrame
- MapInPandas
- Pandas API on Spark
- Code Questions: filter, evaluators, transformer/estimators, dbutils.data.summarize
- Repos: clone, pull, push, branch, etc
- Versioning of data and models: what happens when you register a new model and when you register a pre-registered model
- Databricks imputers: especially for categorical features(stringIndexer and OHE)
- AutoML: where to find the best model and the EDA notebook
- Scikit-learn: single node model and what things can/cannot work with it
- Feature Store: create table, score_batch
- Autolog : nested runs
- MLflow : search_run, default evaluation metric for regression/classification
- Hyperopt : Algos used by it and how to use it with Databricks
- SparkTrials : Tradeoff between parallelism and max_evals.
Some general pointers for the exam,
All of the questions I got in my exam where just MCQ with one correct answer. However, I did not find any statement mentioning that there will be MCQ with multiple correct answers. One of the things on which i cannot emphasise more is to stay focused on the “Scalable Machine Learning” course for this exam. Like, literally all the functions, little notes/functions in the notebook and the comments mentioned by the instructor is everything you need to pass this exam.
Another important thing to keep in mind is to study the UI too. I know it sounds a bit lame but, there were definitely around 2–3 questions based on the UI. The format of these questions is like on which page can you find the model version in the UI or from which page can you see the models and their features together.
There will be code questions so I would suggest going through different methods available for mlflow and feature store clients. In these questions you might have to:
- asses a given code for correctness
- fill in some blank spaces in the code
- suggest corrective measures for a given task
If you’ve made it this far in the article, then you definitely deserve a
Bonus point : The exam suggests taking just the Scalable Machine Learning course for this certification, but I would highly recommend going through the “Experimentation” section of the Machine Learning in Production(v2) course. This will definitely give you an edge in the exam.
All the best !!!