Streamlining Machine Learning with BigQuery ML: A Comprehensive Overview | by Saeed Mohajeryami, PhD

BigQuery ML is a fun and easy way to get into the world of machine learning. It’s a tool that lets you build and deploy models directly within BigQuery, the popular Google Cloud data warehouse. You don’t need to know any fancy programming languages like Python or R, just the SQL skills you already have. With BigQuery ML, you can create, train and use models to make predictions all in one place. It’s a great option for anyone looking to add some extra intelligence to their big data without the added complexity.

As described earlier, BigQueryML is a ML tool that allows you to build and run your models directly within BigQuery. This means that you don’t have to move your data out of BigQuery in order to build and run models — it all happens right within the platform. Pretty cool, right?

Now, let’s dive into how it actually works. Essentially, BigQuery ML utilizes a process called SQL-based modeling. This means that, instead of writing code in a language like Python or R, you use SQL to define and build your models. For example, you can use SQL to create a model that predicts customer churn or sales forecast.

Once you’ve defined your model using SQL, BigQuery ML takes care of the heavy lifting in terms of training and evaluating the model. This includes things like splitting your data into training and test sets, applying machine learning algorithms, and evaluating the model’s performance.

One of the great things about BigQuery ML is that it supports a variety of different types of models, including linear and logistic regression, k-means clustering, and neural network models. This means that you can use the tool to build models for a wide range of use cases, from simple linear regression models to more complex neural network models.

Once your model is trained and evaluated, you can then use it to make predictions on new data. This is done by running a SQL query that includes the model and the new data you want to make predictions on. The model then generates predictions, which can be used to make decisions or take action.

With BigQuery ML, you can easily create and train ML models without having to worry about managing complex infrastructure or dealing with complicated programming languages.

The first step in creating a model with BigQuery ML is to select the dataset that you want to use for training. You can do this by simply selecting the appropriate table from your BigQuery dataset. Once you’ve selected the dataset, you can then choose the type of model that you want to create. BigQuery ML supports a variety of models including linear regression, logistic regression, and deep neural networks.

Once you’ve chosen the type of model, you can then start training it. BigQuery ML uses a powerful and efficient algorithm to train your model, so you don’t have to worry about waiting around for hours or days for your model to be trained. You can monitor the progress of the training process in real-time, and you’ll be notified when the training is complete.

Below, I provide a simple code in Python that uses BigQuery and BigQueryML SDK to orchestrate the training inside the BigQuery. The code tries to create a linear regression model using the boston_housing dataset:

# Import the necessary libraries
from google.cloud import bigquery
from google.cloud import bigquery_ml# Connect to BigQuery
client = bigquery.Client()
# Define the dataset and table that you want to use for training
dataset_id = "boston_housing"
table_id = "training_data"
table_ref = client.dataset(dataset_id).table(table_id)
# Define the columns that you want to use as the input and output for the model
input_cols = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"]
output_col = "MEDV"
# Create the linear regression model
model_id = "boston_housing_model"
create_model_query = (
bigquery_ml.dml.CreateModel(model_id)
.with_linear_regression(input_cols, output_col)
.with_training_data(table_ref)
)
# Run the query to create the model
create_model_query.execute(client)
# Train the model
train_query = bigquery_ml.dml.TrainModel(model_id)
train_query.execute(client)

This code connects to BigQuery, selects the boston_housing dataset and the table called training_data as the data source, then defines the columns that will be used as input and output for the model. It then creates a linear regression model with the specified inputs and output. Finally, it trains the model. Note that in order to use this code, you need to have the necessary credentials to access your BigQuery project and you should have the boston_housing dataset already in your BigQuery.

This is just a simple example of how you can use BigQuery ML to make predictions, but you can also export the predictions to a new table or use the results for further analysis and visualization.

The first step in making predictions with BigQuery ML is to select the trained model that you want to use. You can do this by simply selecting the appropriate model from your BigQuery dataset. Once you’ve selected the model, you can then input the data that you want to make predictions on. You can do this by either inputting the data manually or by selecting a table from your BigQuery dataset.

Once you’ve inputted the data, you can then run the prediction. BigQuery ML uses a powerful and efficient algorithm to make predictions, so you don’t have to wait around for hours or days for the predictions to be made. You can also monitor the progress of the prediction process in real-time, and you’ll be notified when the predictions are complete.

Below, I use the earlier example and show it is done. Once you’ve trained your model, you can make predictions on new data. Here’s how you can use the boston_housing_model to make predictions on a table called test_data:

# Define the test data table
test_table_id = "test_data"
test_table_ref = client.dataset(dataset_id).table(test_table_id)# Create the query to make predictions using the model
predict_query = (
bigquery_ml.dml.Predict(model_id, test_table_ref)
.with_inputs(input_cols)
.with_output("predicted_medv")
)
# Run the query to make predictions
predictions = predict_query.execute(client)
# Print the predictions
for row in predictions:
print("Predicted median value: {:.2f}".format(row["predicted_medv"]))

This code defines the test data table test_data which is used to make predictions using the boston_housing_model that we created earlier. Then, a query is created using the bigquery_ml.dml.Predict() method, specifying the model to use, the test table, the inputs columns that we used to train the model and the output column name predicted_medv. Then the query is executed and the results are printed, showing the predicted median value for each row in the test data table.

One question is how to change the model parameters? Below, I use a logistic regression model and try to explain how to change parameters using that example.

# Import the necessary libraries
from google.cloud import bigquery
from google.cloud import bigquery_ml# Connect to BigQuery
client = bigquery.Client()
# Define the dataset and table that you want to use for training
dataset_id = "my_dataset"
table_id = "training_data"
table_ref = client.dataset(dataset_id).table(table_id)
# Define the columns that you want to use as the input and output for the model
input_cols = ["age", "income", "education"]
output_col = "customer_churn"
# Create the logistic regression model
model_id = "logistic_regression_model"
create_model_query = (
bigquery_ml.dml.CreateModel(model_id)
.with_logistic_regression(input_cols, output_col)
.with_training_data(table_ref)
)
# Run the query to create the model
create_model_query.execute(client)
# Train the model
train_query = bigquery_ml.dml.TrainModel(model_id)
train_query.execute(client)

This code is similar to the previous example, but it uses a logistic regression model instead of linear regression. Also, it uses different column names and dataset name.

To change the parameters of the logistic regression model, you can use the with_parameter() method. For example, you can change the learning rate of the model by adding the following line before the create_model_query.execute(client) line:

create_model_query.with_parameter("learning_rate", 0.1)

This code sets the learning rate of the model to 0.1. You can also change other parameters such as the number of steps or the optimizer. To see the full list of parameters that you can change, you can check the BigQuery ML documentation. However, I’ll give you a snippet of it with some examples.

Number of steps

create_model_query.with_parameter("steps", 1000)

Regularization: You can change the amount of regularization applied to the model and pass in the parameter name l1_regularization or l2_regularization followed by the regularization value.

create_model_query.with_parameter("l2_regularization", 0.001)

Optimizer: You can change the optimizer by passing in the parameter name “optimizer” followed by the optimizer name. For example:

create_model_query.with_parameter("optimizer", "Adagrad")

Initialization method: You can change the initialization method by passing in the parameter name init_method followed by the initialization method name.

create_model_query.with_parameter("init_method", "random")

Evaluation metric: You can change the evaluation metric by passing in the parameter name “evaluation_metric” followed by the metric name.

create_model_query.with_parameter("evaluation_metric", "accuracy")

It’s worth noting that changing the parameters of the model may affect the accuracy and performance of the model, so you should experiment with different values and see what works best for your data and use case.

The examples I showed you so far uses SDK. I also want to show the underlying SQL structure of BigQueryML. Basically, if you are using BigQuery UI or if you use a BigQuery orchestration tool like dbt, how should you train and predict.

Here’s an example of how you can train a logistic regression model and use it to make predictions:

Training: To train a logistic regression model, you can use the CREATE MODEL statement in BigQuery. Here’s an example of how you can train a logistic regression model called my_model using the my_table table as the training data. To change the parameters of the logistic regression model, you can add additional options to the OPTIONS clause.

CREATE MODEL my_project.my_dataset.my_model
OPTIONS(model_type='logistic_reg', learning_rate=0.1, steps=1000, optimizer='Adagrad')
AS
SELECT
age,
income,
education,
customer_churn
FROM my_project.my_dataset.my_table;

2. Predicting: After training the model, you can use it to make predictions on new data. Here’s an example of how you can use the my_model to make predictions on a table called test_data:

SELECT
age,
income,
education,
predicted_customer_churn
FROM
ML.PREDICT(MODEL my_project.my_dataset.my_model,
(SELECT
age,
income,
education
FROM my_project.my_dataset.test_data))

In this example, the ML.PREDICT function is used to make predictions on the test_data table. It takes two arguments, the first one is the model that you want to use, the second one is the table that you want to make predictions on. The SELECT statement at the end of the query retrieves the results of the predictions.

BigQuery materializes the training results as amodel. Basically, when you materialize as amodel, you are creating a physical table in BigQuery that represents the results of running a BigQuery ML SQL query. The table contains the model parameters, statistics and other information about the model, so you can use it later to make predictions.

BigQuery ML is a powerful tool that can be used for a variety of different use cases. Some of the most popular use cases for BigQuery ML include:

Predictive modeling: BigQuery ML can be used to create predictive models that can be used to predict future events or trends. For example, you can use BigQuery ML to create a model that predicts sales or customer behavior.
Natural Language Processing: BigQuery ML can be used to process natural language text. For example, you can use BigQuery ML to classify text into categories like sentiment, topic, or language.
Recommender Systems: BigQuery ML can be used to create recommender systems that can suggest products, content, or other items to users based on their preferences.
Anomaly Detection: BigQuery ML can be used to detect anomalies in data. For example, you can use BigQuery ML to detect unusual patterns or outliers in financial or sensor data.

Integrating BigQuery ML with other GCP services is very easy as they all belong to the same provider and their APIs can seamlessly work with each other. It’s very convenient especially if you use an IaaS tool like Terraform (which is a story for another article!). With BigQuery ML, you can easily connect your ML models to other GCP services to create powerful and sophisticated data pipelines.

One of the most popular ways to integrate BigQuery ML with other services is by using Cloud Dataflow. Cloud Dataflow allows you to create data pipelines that can process and transform large amounts of data in real-time. You can use Cloud Dataflow to clean, preprocess, and transform your data before it’s used to train your BigQuery ML models. This way you can ensure that your models are using the most accurate and up-to-date data possible.

Another great way to integrate BigQuery ML with other services is by using Cloud Storage. Cloud Storage is a powerful and scalable object storage service that allows you to store and access large amounts of data. You can use Cloud Storage to store your training data and then easily access it from BigQuery ML when you’re ready to train your models.

You can also integrate BigQuery ML with VertexAI to enhance your models with advanced features like TensorFlow, Keras, and scikit-learn. This way you can use the power of Google’s ML platform to train your models and achieve even better results.

Finally, you can also integrate BigQuery ML with Bigtable and Dataproc to create powerful data pipelines that can process and analyze large amounts of data in real-time. This way you can make the most of your data and extract insights that can help you make better decisions and improve your business outcomes.

While BigQuery ML is a powerful tool for creating and training machine learning models, there are a few limitations and considerations that you should keep in mind when using it.

First and foremost, BigQuery ML is designed to work with structured data, which means that it’s not the best tool for working with unstructured data such as text, images, or videos. If you need to work with unstructured data, you’ll need to pre-process it and convert it into a structured format before you can use it with BigQuery ML.

Another limitation of BigQuery ML is that it’s not as flexible as other machine learning tools when it comes to model customization. While you can choose from a variety of pre-built models, you won’t have as much control over the specifics of the model as you would with other tools.

Additionally, BigQuery ML is designed to work with large datasets, so if you’re working with smaller datasets, you may find that it’s not the most efficient option.

Finally, it’s worth noting that BigQuery ML is a cloud-based tool, so you’ll need an internet connection to use it. If you’re working in an area with spotty internet, you may experience some difficulties when using BigQuery ML.

Despite these limitations, BigQuery ML is still a great tool for creating and training machine learning models, especially if you’re working with structured data and large datasets. Just keep these limitations and considerations in mind when deciding whether or not to use BigQuery ML for your project.

Google Cloud documentation: The official Google Cloud documentation provides a comprehensive guide to BigQuery ML, including detailed explanations of the features and functionality, as well as tutorials and examples. You can find the documentation here.
Google Cloud blog: The Google Cloud blog features articles and case studies about BigQuery ML and how it is being used by organizations to solve real-world problems. You can find the blog here.
Kaggle: Kaggle is a platform for data science and machine learning competitions, and it features a number of tutorials and examples that demonstrate how to use BigQuery ML for different tasks. You can find the tutorials here.
Coursera: Coursera offers a number of online courses on BigQuery and BigQuery ML, including a course by Google Cloud called “BigQuery for Data Analysts.” You can find the courses here.