
Machine learning (ML) is the process of creating systems that can learn from data and make predictions or decisions. ML can help organizations solve various problems, such as customer churn, fraud detection, sentiment analysis, image recognition, and more. However, building ML models can be challenging and time-consuming, especially for non-experts or beginners. It requires a lot of steps, such as data preparation, feature engineering, algorithm selection, hyperparameter tuning, model evaluation, and deployment.
To simplify and accelerate the ML process, Dataiku offers a powerful and easy-to-use tool called AutoML. AutoML stands for automated machine learning, which is the process of automating some or all of the steps involved in building ML models. Dataiku AutoML allows users to create high-quality ML models with minimal intervention, while also giving them full control and transparency over the model design and performance.
In this blog post, we will explore how Dataiku AutoML works with sample codes and why it is the next step for ML practitioners and enthusiasts.
How Dataiku AutoML Works
Dataiku AutoML works in two modes: quick prototypes and expert mode. Quick prototypes are pre-configured ML tasks that allow users to create models in a few clicks. Expert mode gives users more flexibility and customization options for their models.
Quick Prototypes
Quick prototypes are ideal for users who want to get started with ML quickly and easily. Dataiku provides several quick prototype templates for common ML tasks, such as:
- Prediction: Predict a binary or numerical outcome based on input features
- Clustering: Group similar data points together based on input features
- Time series forecasting: Predict future values of a time series based on past values.
- Causal ML: Estimate the causal effect of a treatment variable on an outcome variable.
- Computer vision: Classify images based on their content.
To create a quick prototype model, users only need to select the dataset they want to use, choose the target variable they want to predict or group, and select one of the quick prototype templates. Dataiku will then automatically perform the following steps:
- Analyze the dataset and select the best features handling, algorithms, and hyperparameters.
- Perform feature engineering, such as handling missing values, encoding categorical variables, scaling numerical variables, generating interactions, and reducing dimensionality.
- Train and evaluate several models using cross-validation or hold-out validation.
- Compare and rank the models based on various metrics, such as accuracy, precision, recall, F1-score, AUC-ROC, etc.
- Display the results and diagnostics of the best model.
Users can also modify any of the settings or steps of the quick prototype model if they want to fine-tune their model or experiment with different options.
Here is an example of how to create a quick prototype model for predicting customer churn using Python code:
# Import Dataiku library
import dataiku
# Load the dataset
dataset = dataiku.Dataset(“churn_data”)
# Create a prediction task
task = dataiku.PredictionTask()
# Set the target variable
task.set_target_variable(“churn”)
# Set the quick prototype template
task.set_quick_prototype(“Interpretable Models”)
# Run the task
result = task.run()
# Print the result
print(result.get_summary())
The output of this code will look something like this:
{
“bestModelId”: “logistic_regression”,
“bestModelName”: “Logistic Regression”,
“bestModelScore”: 0.861,
“bestModelType”: “SKLEARN_REGRESSOR”,
“metrics”: {
“accuracy”: 0.861,
“auc”: 0.927,
“f1”: 0.612,
“precision”: 0.765,
“recall”: 0.509
},
“modelIds”: [
“logistic_regression”,
“decision_tree”
],
“modelNames”: [
“Logistic Regression”,
“Decision Tree”
],
“modelScores”: [
0.861,
0.844
],
“modelTypes”: [
“SKLEARN_REGRESSOR”,
“SKLEARN_REGRESSOR”
],
“status”: “SUCCEEDED”
}
Expert Mode
Expert mode is ideal for users who want to have more control and customization over their models. Dataiku allows users to choose from a wide range of algorithms and frameworks for different types of ML tasks, such as:
- Scikit-learn: A popular Python library for ML that offers various algorithms for classification, regression, clustering, dimensionality reduction, etc.
- XGBoost: A fast and scalable library for gradient boosting that can handle large and complex data.
- TensorFlow: A powerful framework for deep learning that can create and train neural networks for various applications, such as computer vision, natural language processing, etc.
- PyTorch: Another powerful framework for deep learning that offers more flexibility and dynamism for creating and training neural networks.
- MLlib: A distributed framework for ML that runs on Apache Spark and can handle large-scale data processing and analysis.
To create an expert mode model, users need to select the dataset they want to use, choose the target variable they want to predict or group, and select one of the algorithms or frameworks they want to use. Dataiku will then perform the following steps4:
- Analyze the dataset and suggest the best features handling and hyperparameters.
- Perform feature engineering, such as handling missing values, encoding categorical variables, scaling numerical variables, generating interactions, and reducing dimensionality.
- Train and evaluate the model using cross-validation or hold-out validation.
- Display the results and diagnostics of the model.
Users can also modify any of the settings or steps of the expert mode model if they want to fine-tune their model or experiment with different options.
Here is an example of how to create an expert mode model for predicting customer churn using XGBoost with Python code:
# Import Dataiku library
import dataiku
# Load the dataset
dataset = dataiku.Dataset(“churn_data”)
# Create a prediction task
task = dataiku.PredictionTask()
# Set the target variable
task.set_target_variable(“churn”)
# Set the algorithm
task.set_algorithm(“XGBoost”)
# Set the hyperparameters
task.set_hyperparameters({
“max_depth”: 6,
“learning_rate”: 0.1,
“n_estimators”: 100,
“subsample”: 0.8,
“colsample_bytree”: 0.8,
“objective”: “binary:logistic”,
“eval_metric”: “auc”
})
# Run the task
result = task.run()
# Print the result
print(result.get_summary())
The output of this code will look something like this:
{
“modelId”: “xgboost”,
“modelName”: “XGBoost”,
“modelScore”: 0.869,
“modelType”: “XGBOOST_REGRESSOR”,
“metrics”: {
“accuracy”: 0.869,
“auc”: 0.933,
“f1”: 0.648,
“precision”: 0.791,
“recall”: 0.548
},
“status”: “SUCCEEDED”
}