![](https://crypto4nerd.com/wp-content/uploads/2024/02/1xaIsROevXai28wRVYAT8ag-1024x1022.png)
Today we tackle a Kaggle competition known as PlantTraits2024. The goal of this dataset is to understand the global patterns of biodiversity in plants. This is done by finding six traits from crowdsourced plant images and ancillary data. This sort of research matters, since plants hold special traits in understanding the ecosystem in general, i.e., the diversity, productivity, etc.
The competition provides us with training images, test images, a train csv file, a test csv file, and other files helpful for our understanding.
We first try to understand the outputs of our dataset. A description of the six variables are given below
We can plot the distribution of the six traits, which gives the following
The distribution tells that a log scaling has to be applied in order to stabilize the traits besides X4.
Now, we can observe the images and see that to extract any useful features, we have to treat it as a vision problem. A Convolution Neural Network would be the best tool to extract information for our prediction
Other features were also provided with the images such as the temperatures, the soil readings, and the geolocation information. We can find a way to associate such information with the images when trained with a CNN network.
The function used in scoring the competition is the R2 function, also known as the coefficient of determination and is a commonly used metric in regression types of models.
We first try to observe how well we would perform if we were to simply train it with a conventional CNN. Here, we use a pretrained ResNet50 model from Hugging Face, and observed the performance.
The images were also resized to 224×224, and the mean, std, and pixel values were normalized around the preferred range for neural networks. A validation set was also used which was 10% of the training data. We trained the network over 10 epochs with a batch size of 32.
The gradient used was an ADAM, which is common for CNNs. A log10 scaling was also applied in training to the output variables (besides X4_mean). The loss function was an R2 loss function scaled to be capped at 1. We would like to note that a higher R2 value would mean a better predictive power.
The model finished training in 1611.968 seconds. We can notice from the loss epoch graph that the model might not be great at generalization given the disparity between validation set and the training set. We validate this with out test dataset.
However, the test data can only be tested over the Kaggle competition. So the submission result turned out to be the following
The performance turned out to be poor. However, we can apply other techniques to improve the model.
We try to further improve our ResNet50 model by performing more vigorous data augmentation and a learning rate scheduler.
The learning rate scheduler would have a max learning rate of 0.054946917 with a weight decay of 0.01. The maximum steps is scaled by the training and validation dataset size, batch size, and epoch.
In the image augmentation process, we set a 50% chance for the image to horizontally flip, while the images are cropped to a 384×512 resolution and resized to 224×224. Brightness was also randomly reduced at a 50% chance and similarly the image compression was lowered between a range of 75 to 100 at a 50% chance.
# Augmentationupdated_transform_pipe = A.Compose([
# Randomly flip images
A.HorizontalFlip(p=0.5),
# Also crop images and then resize images to 224 x 224
A.RandomSizedCrop(
[384, 512],
224, 224, w2h_ratio=1.0, p=1.0
),
# Also randomly adjust brightness
A.RandomBrightnessContrast(brightness_limit=0.10, contrast_limit=0.10, p=0.50),
# Also randomly change the compression
A.ImageCompression(quality_lower=75, quality_upper=100, p=0.5),
A.ToFloat(),
# Normalize the mean and standard dev
A.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
max_pixel_value=1
),
# Converts images to tensors
ToTensorV2(),
])
test_transformation_pipe = A.Compose([
# Resize image to 224 x 224
A.Resize(
224,
224,
always_apply = True,
),
A.ToFloat(),
# Normalize the mean and standard dev
A.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
max_pixel_value=1
),
# Converts images to tensors
ToTensorV2(),
])
We then followed the same procedure as the old model and the following results was achieved with a train time of 1636.625 seconds
While the performance doubled, we can still see that the performance is still lacking. We can notice that increasing the epochs might not do much, since the validation score stalls out after a while.
One way of improving could be that different augmentation techniques or parameters can be used. However the true issue which can be observed is that we need to utilize additional information provided or use a different model other than the ResNet50.
So one way to improve the progress would be to utilize the other information besides the images provided. This gave an idea to be that by using ensemble methods, we can combine the performance of different kinds of models instead of being restricted to only working with a CNN.
So, from understanding the dataset, we can come up with three separate models:
- Vision Model
- Weather Model
- Soil Model
Here, the vision model is the same as the previous model we discussed. However, what we also did is combine the geolocations provided for the images with the ResNet50 features by concatenating the ResNet50 features with the geolocation values. We then continue training it over an MLP network.
The MLP network is a simple fully connected network which has linear layers and ReLU activation functions. From the code below, we can notice that there are four hidden layers which goes from 121 units to 10 units and the last layer simply converts it to the six traits.
class ModelVision(nn.Module):
def __init__(self):
super().__init__()
self.backbone = timm.create_model(
'resnet50.a1_in1k',
pretrained=True,
num_classes=25,
)
self.mlp = nn.Sequential(OrderedDict([
('dense1', nn.Linear(121, 100)),
('a1', nn.ReLU()),
('dense2', nn.Linear(100, 50)),
('a2', nn.ReLU()),
('dense3', nn.Linear(50,20)),
('a3', nn.ReLU()),
('dense4', nn.Linear(20, 10)),
('a4', nn.ReLU()),
('output', nn.Linear(10, 6)),
]))def forward(self, inputs_img, inputs_numeric):
cnn_out = self.backbone(inputs_img)
merged = torch.cat((cnn_out, inputs_numeric.float()),1)
return self.mlp(merged)
Individually, we can note the performance of the vision model by looking at the loss epoch graph below. The train time came out to be 1630.355 seconds, no different from the previous models.
We can notice from the other two ResNet50 models, the validation error and the train error match really closely, which says that our model at the very least generalizes with a decent error rate.
Now, we consider the weather model. The weather model is simple a MLP model with five hidden layers and takes the six weather columns as input.
We use the same ADAM gradient function with a learning rate scheduler and same train, test, validation protocol as the previous image models. The train time was 97.205 seconds. We can see the performance below
This is identical to the vision model, where we can see that atleast generalization capabilities can be realized since the train and validation errors are similar after a few epochs.
The last model is the soil model. This also is another MLP network with five hidden layers. However, the hidden layers are layered differently since there are 61 features, but the hidden layers are designed similarly to the weather model.
Just as before, we use the same ADAM gradient function with the same learning rate scheduler. The train time was 108.738 seconds. The individual performance is described below.
We now combine the three models discussed above by taking their predictions, and taking an average of the three values. This makes sense to do, since the outputs are real numbers and are not binary labels.
The test data cannot be tested locally, since the true values are only provided by the Kaggle competition. Submitting to Kaggle, we get the following:
We can notice that there has been a super significant difference between the first two models we have trained. This is great improvement, placing me at 54th out of 116th. However, the R2 loss is still negative, although being super close to 0. This means there is a lot of room for improvement still.
Further improvements can be done by setting better parameters with regards to all three models. We can also come up with a better ensemble design. Additionally, we can try a different vision model other than ResNet50 and see if it can perform better. We can however conclude that ensemble methods can enhance image classification or regression if given more context on the image itself.