Boosting Machine Learning Performance With Rust (Part 2) | by Vince Vella

Experimenting with Convolutional Neural Networks (CNNs) from scratch in Rust.

In my previous article (Part 1) I started my experiment to develop a machine learning framework in Rust from scratch. The main aim of my experiment was to gauge model training speed improvements that can be attained by using Rust in conjunction with PyTorch over a Python equivalent. Results were very encouraging for Feedforward Networks. In this article I continue building on that, with the main objective being to be able to define and train Convolutional Neural Networks (CNNs). As in the previous article, I continue to make use of the Tch-rs Rust crate as a wrapper to the PyTorch C++ Library LibTorch, primarily to access the tensors linear algebra and autograd functions, and the rest is developed from scratch. The code for Part 1 and 2 are now available on Github (Link).

The final outcome from this article allows one to define Convolutional Neural Networks (CNNs) in Rust as follows:

Listing 1 — Defining my CNN model.

struct MyModel {
l1: Conv2d,
l2: Conv2d,
l3: Linear,
l4: Linear,
}impl MyModel {
fn new (mem: &mut Memory) -> MyModel {
let l1 = Conv2d::new(mem, 5, 1, 10, 1);
let l2 = Conv2d::new(mem, 5, 10, 20, 1);
let l3 = Linear::new(mem, 320, 64);
let l4 = Linear::new(mem, 64, 10);
Self {
l1: l1,
l2: l2,
l3: l3,
l4: l4,
}
}
}
impl Compute for MyModel {
fn forward (&self,  mem: &Memory, input: &Tensor) -> Tensor {
let mut o = self.l1.forward(mem, &input);
o = o.max_pool2d_default(2);
o = self.l2.forward(mem, &o);
o = o.max_pool2d_default(2);
o = o.flat_view();
o = self.l3.forward(mem, &o);
o = o.relu();
o = self.l4.forward(mem, &o);
o
}
}

… and then instantiate and train as follows:

Listing 2— Training CNN model.

fn main() {
let (mut x, y) = load_mnist();
x = x / 250.0;
x = x.view([-1, 1, 28, 28]);let mut m = Memory::new();
let mymodel = MyModel::new(&mut m);    
train(&mut m, &x, &y, &mymodel, 20, 512, cross_entropy, 0.0001);
let out = mymodel.forward(&m, &x);
println!("Accuracy: {}", accuracy(&y, &out));
}

By trying to keep model definition as similar as possible with a Python equivalent, Listing 1 above should be quite intuitive for Python-PyTorch users. In the MyModel struct we are now able to add Conv2D layers and then initiate them in the associated function new. In the Compute trait implementation, the forward function is defined and takes the input through all the layers including the intermediate MaxPooling function. In the main function (Listing 2), similar to our previous article, we are training our model and applying it on the Mnist dataset.

In the next sections I will describe what is under the hood to be able to define and train CNNs in this manner. It is assumed that the reader followed my first article (Part 1), hence in this article I will focus only on the new additions to my framework.

The unique characteristic of convolutional networks is that in some layers (at least one) we apply a convolution instead of general matrix multiplication. The purpose of the convolution operation is that it makes use of a kernel to extract certain features of interest from an input image. A kernel is a matrix, which is slid across and multiplied across sub-sections of the image (input) such that the output is a transformation of the input in a certain desirable manner (see diagram below).

Performing a Convolution using a Kernel (Source: Kernel (image processing) — Wikipedia)

In a two dimensional case, we use a two-dimensional image I as our input, we typically also use a two-dimensional kernel K, resulting in the following convolution calculations:

Equation 1 — Convolution for two-dimensional case (Goodfellow et. al., 2016)

As can be deduced from the Equation 1, a naïve algorithm to apply a convolution is quite costly from a computational perspective due to significant amount of looping and matrix multiplications. To make it worse, this calculation has to be repeated several times for each convolutional layer in the network and for each training example/batch. Hence, before extending my library from Part 1 to handle CNNs, the first step was to investigate an efficient way to calculate convolutions.

Finding efficient ways how to calculate convolutions is a very well researched problem (see Link). After investigating different options, which included some pure rust versions but that required intermediate data transformations from PyTorch tensors, I opted to to use the LibTorch C++ convolution function. To experiment with this function I wanted to create a small toy program that takes a color image, converts it to grey scale, and then apply some known kernels to perform edge detection.

I first asked Microsoft Bing chat to generate an image for me. Once I was happy with the image, I wanted to apply the convolution function using a Gaussian kernel first, followed by a Laplacian kernel.

The kernels were applied using the LibTorch C++ method conv2d, which is exposed via Tch-rs as:

Listing 3— LibTorch conv2d method as exposed Tch-rs.

pub fn conv2d<T: Borrow<Tensor>>(
&self,
weight: &Tensor,
bias: Option<T>,
stride: impl IntList,
padding: impl IntList,
dilation: impl IntList,
groups: i64
) -> Tensor

My final toy program is shown below:

Listing 4— Taking an image and apply convolution operations for edge detection.

use tch::{Tensor, vision::image, Kind, Device};fn rgb_to_grayscale(tensor: &Tensor) -> Tensor {
let red_channel = tensor.get(0);
let green_channel = tensor.get(1);
let blue_channel = tensor.get(2);
// Calculate the grayscale tensor using the luminance formula
let grayscale = (red_channel * 0.2989) + (green_channel * 0.5870) + (blue_channel * 0.1140);
grayscale.unsqueeze(0)
}
fn main() {
let mut img = image::load("mypic.jpg").expect("Failed to open image"); 
img = rgb_to_grayscale(&img).reshape(&[1,1,1024,1024]);
let bias: Tensor = Tensor::full(&[1], 0.0, (Kind::Float, Device::Cpu));
// Define and apply Gaussian Kernel
let mut k1 = [-1.0, 0.0, 1.0, -2.0, 0.0, 2.0, -1.0, 0.0, 1.0];
for element in k1.iter_mut() {
*element /= 16.0;
}
let kernel1 = Tensor::from_slice(&k1)
.reshape(&[1,1,3,3])
.to_kind(Kind::Float);
img = img.conv2d(&kernel1, Some(&bias), &[1], &[0], &[1], 1);
// Define and apply Laplacian Kernel
let k2 = [0.0, 1.0, 0.0, 1.0, -4.0, 1.0, 0.0, 1.0, 0.0];
let kernel2 = Tensor::from_slice(&k2)
.reshape(&[1,1,3,3])
.to_kind(Kind::Float);
img = img.conv2d(&kernel2, Some(&bias), &[1], &[0], &[1], 1);
image::save(&img, "filtered.jpg");
}

The result of the operation is the following:

Original image (left) and resulting image after applying Gaussian and Laplacian Kernel filters on grey-scaled image (right).

In this toy program we applied our chosen kernels for the convolution, transforming the original image into edges (our desired features). In the next section I describe how this idea is incorporated in a CNN, with the main difference being that the value of the kernel matrix is chosen by the network during training — i.e. the network itself decides which features to select from the image by tweaking the kernel.

A CNN typical structure includes a number of Convolution layers, each followed by a subsampling layer (pooling), which then typically feed into fully connected layers. Pooling contributes greatly to parameter reduction, as it down-samples the input data. The diagram below depicts one of the earliest CNNs called LeNet-5.

The LeNet-5 Network Architecture (Lecun et. al., 1998)

In Part 1 we already defined a simple framework that included a fully connected layer. Similarly, what we need to do now is inject a definition of a convolutional layer in our framework so that we have it available when defining new network architectures (like in Listing 1). The other thing that we need to keep in mind is that in our toy program (Listing 4) both the kernel matrix and the bias we set as fixed, however now we need to define them as network parameters that are trained by our training algorithm, hence we need to keep track of their gradients and update accordingly.

The new Convolutional Layer, Conv2d, is defined as follows:

Listing 5— New Conv2d Layer.

pub struct Conv2d {
params: HashMap<String, usize>,
}impl Conv2d {
pub fn new (mem: &mut Memory, kernel_size: i64, in_channel: i64, out_channel: i64, stride: i64) -> Self {
let mut p = HashMap::new();
p.insert("kernel".to_string(), mem.new_push(&[out_channel, in_channel, kernel_size, kernel_size], true));
p.insert("bias".to_string(), mem.push(Tensor::full(&[out_channel], 0.0, (Kind::Float, Device::Cpu)).requires_grad_(true)));
p.insert("stride".to_string(), mem.push(Tensor::from(stride as i64)));
Self {
params: p,
}
} 
}
impl Compute for Conv2d {
fn forward (&self,  mem: &Memory, input: &Tensor) -> Tensor {
let kernel = mem.get(self.params.get(&"kernel".to_string()).unwrap());
let stride: i64 = mem.get(self.params.get(&"stride".to_string()).unwrap()).int64_value(&[]);
let bias = mem.get(self.params.get(&"bias".to_string()).unwrap());
input.conv2d(&kernel, Some(bias), &[stride], 0, &[1], 1)
}
}

If you recall my approach from Part 1, the struct contains a field named params. The params field is a collection of type HashMap, where the key is of type String, which stores a parameter name, and the value is of type usize, which holds the location of the specific parameter (which is a PyTorch tensor) in our Memory, which in turn acts as our store for all our model parameters. In the case of the Convolutional Layer, in our associated function new, we insert on our HashMap two parameters “Kernel” and “bias” that are set with required_gradient flag True. I also inject a parameter “Stride”, however this is not set as a trainable parameter.

We then implement the Compute trait for our Convolutional Layer. This requires defining the function forward, which is called during the forward pass of the training process. In this function, we first obtain a reference to the kernel, bias and stride tensors from our tensor store using the get method and then call our Conv2d function (as we did in our toy program, however in this case the network is telling us what kernel to use). Padding was hardcoded to a zero value, however if you wish this can be also easily added as a parameter similar to Stride.

And that’s it! That is the only addition required in our little framework from Part 1 to be able to define and train CNNs as in Listing 1–2.

In my previous article I programmed two training algorithms, Stochastic Gradient Descent and Stochastic Gradient Descent with Momentum. However, probably one of the most popular training algorithms today is Adam — so why not, lets code it in Rust too!

Adam algorithm was first published in 2015 (Link), and basically combines the idea of Momentum and RMSprop training algorithms. The algorithm from the original paper is as follows:

In Part 1 we implemented our tensor Memory, which also caters for the PyTorch equivalent gradient step function (methods apply_grads_sgd and apply_grads_sgd_momentum). Hence a new method is added in the Memory struct implementation that performs gradient update using Adam:

Listing 6 — Our implementation of Adam.

fn apply_grads_adam(&mut self, learning_rate: f32) {
let mut g = Tensor::new();
const BETA:f32 = 0.9;let mut velocity = Tensor::zeros(&[self.size as i64], (Kind::Float, Device::Cpu)).split(1, 0);
let mut mom = Tensor::zeros(&[self.size as i64], (Kind::Float, Device::Cpu)).split(1, 0);
let mut vel_corr = Tensor::zeros(&[self.size as i64], (Kind::Float, Device::Cpu)).split(1, 0);
let mut mom_corr = Tensor::zeros(&[self.size as i64], (Kind::Float, Device::Cpu)).split(1, 0);
let mut counter = 0;
self.values
.iter_mut()
.for_each(|t| {
if t.requires_grad() {
g = t.grad();
mom[counter] = BETA * &mom[counter] + (1.0 - BETA) * &g;
velocity[counter] = BETA * &velocity[counter] + (1.0 - BETA) * (&g.pow(&Tensor::from(2)));    
mom_corr[counter] = &mom[counter]  / (Tensor::from(1.0 - BETA).pow(&Tensor::from(2)));
vel_corr[counter] = &velocity[counter] / (Tensor::from(1.0 - BETA).pow(&Tensor::from(2)));
t.set_data(&(t.data() - learning_rate * (&mom_corr[counter] / (&velocity[counter].sqrt() + 0.0000001))));
t.zero_grad();
}
counter += 1;
});
}

Similar to my approach in Part 1, to compare the above code with a Python-PyTorch equivalent, I tried to be as faithful as possible to get a fair comparison, mainly ensuring that I apply the same Neural Network hyper-parameters, training parameters, and training algorithms. For my tests, I also applied the same Mnist dataset. I ran the tests on same laptop, a Surface Pro 8, i7, with 16G of RAM, hence no GPU.

After running the tests multiple times, on average Rust training resulted in 60% speed improvement over the Python equivalent. Although a significant improvement, this was less than that achieved in the case of FFNs (my findings from Part 1). I attributed this lower improvement to the fact that the most expensive computation in CNNs is the convolution and as discussed above I opted to use LibTorch C++ conv2d function, which at the end of the day is the same function being called by the Python equivalent. However, with that said, reducing the time for model training by more than a half is still not to be disregarded — this would still typically mean hours if not days saved!

Hope you enjoyed my article!

Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org

Pavel Karas and David Svoboda, Algorithms for Efficient Computation of Convolution, in Design and Architectures for Digital Signal Processing, New York, NY, USA:IntechOpen, Jan. 2013.

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P., Gradient-based learning applied to document recognition, Proceedings of the IEEE 86, 2278–2324, 1998.

Diederik P Kingma and Jimmy Ba, Adam: A method for stochastic optimization, in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.

Source link