![](https://crypto4nerd.com/wp-content/uploads/2024/01/10fgq6C0gTYkUE3RVUFwX8g.png)
Hello everyone, some of you may remember me from my first publication here, in which I shared my initial experience with machine learning. I discussed how I utilized a K-means model to classify mushrooms as either poisonous or edible.
Now, in this article, I present my second project in the world of machine learning. This time, I have ventured into the world of Convolutional Neural Networks (CNN) to develop an image detection model. The goal here is to distinguish between images of anime (oriental animation, primarily Japanese) and cartoons (western animation).
Before delving into the technical details of my work, I’d like to share the motivation behind choosing this particular theme. Following my initial venture into machine learning, I sought to challenge myself with a project in image detection. While popular datasets like CIFAR-10 and MNIST were tempting, I yearned for something less common. Thus, I opted for the Anime vs Cartoon dataset. The prospect of working with a subject as subjective as the distinctive visual styles of animation from two culturally distant backgrounds captivated me.
The neural network, or more precisely, the artificial neural network, is a machine learning model designed to instruct computers in processing data with a specific end goal. To reach this goal the neural network uses a process draw inspiration from the human brain’s interconnected system of neurons arranged in layered structures. This design promotes an adaptive learning process, leveraging mistakes and successes to iteratively improve your capabilities.
The architecture of a neural network consists of:
- Input layer: This initial layer is responsible for importing information from the external world into the model. The number of neurons in this layer depends on the dimensionality of the input data.
- Hidden layers: These layers receive data from either the input layer or another hidden layer, process the information, and then transmit it forward. The number of layers and neurons in these layers is determined by the complexity of the problem.
- Output layer: In this final layer, the model produces its output. The number of neurons in this layer depends on the number of categories in your problem.
Other important aspects of the neural network:
- Activation function: The activation function of neurons serves several critical roles, such as introducing non-linearity to the network’s output, capturing specific features through different activation functions, and contributing to regularization.
- Weights: Weights play a crucial role in determining both the strength and direction of influence of each input on the network’s output. These weights are readjusted during the training process to enable the network to learn and adapt.
- Bias: The bias term allows the network to learn to compensate for systematic differences between its predictions and the actual labels. This flexibility enhances the adaptability of the learning process.
The convolutional world of neural netwoks
The main aspect that differentiates a normal neural network and a convolutional neural network lies in the presence of a crucial component — the convolutional layer. But what exactly is a convolutional layer?
These layers are tasked with executing convolution operations on the input data of our model. In simple terms, this mathematical operation involves combining the input with filters (also known as kernels), ultimately producing a feature map. This unique characteristic of convolution plays a pivotal role in extracting visual patterns and establishing a feature hierarchy, especially when dealing with gridded data such as images.
In this project, I employed a workflow similar to the one used in my previous article. The work unfolded through six distinct stages: fetch data, preprocessing, data validation, data segregation, model training, and model testing. These stages collectively formed a systematic approach to developing of the machine learning model.
Development Tools
For the implementation, the Google Colab environment with Python3 served as the primary coding platform, providing collaborative capabilities and access to GPU resources for efficient model training. Additionally, the AI development platform WandB played a crucial role in recording and storing results, along with artifacts generated during the model creation.
Well, after all the explanations of the project, we can delve into the work itself. The data fetching stage occurred together with the choice of the project theme. To accomplish this, I researched some datasets on the Kaggle website. During this exploration, I discovered something interesting for the project, the ‘Anime x Cartoon’ dataset. This dataset comprises more than 8000 animation images divided between two folder, anime and cartoon, and inside of each folder we have other folders of their respective animation title.
Once the zip file housing these archives was uploaded to Google Drive, I do the Fetch Data and Preprocess steps within the same code. While distinct steps, I opted for this integration to streamline the process.
Upon obtaining the archives from Google Drive, the subsequent step involved preprocessing these files — normalizing them and discerning between the data and their corresponding labels. For this purpose, the following code was employed:
Following this step, its necessary to store the results in Weights and Biases (wandb) for future use.
After transforming our raw data (unprocessed data) into clean data during the preprocessing step, it is essential to perform a step of verification. The data check step ensures that the data obtained from preprocessing is well-suited for creating our model. Anticipating common errors that may be present in the data, I’ve established eight checks to validate the appropriate format of both the images and their respective labels.
Identifying any issues at this stage mandates a return to the preprocessing phase for correction and subsequent re-verification. This iterative process ensures the integrity and quality of the data before proceeding with model development.
Now that the data is appropriately prepared, we proceed to segregate it into training and testing sets. However, extra caution is necessary to prevent one label from being overrepresented in subsequent steps. To address this concern, we employ a pre-built method from scikit-learn, train_test_split
, utilizing the stragefy
argument set equal to the variable we wish to distribute evenly.
Finally, after completing all the preceding steps, our model begins its training in this phase of our process. However, before initiating the training process, we must once again segregate the training data, distinguishing between the training set and the validation set. This step is exceptionally necessary because, in each epoch of our training, we require a distinct set of data for validation purposes to assess the performance of the model.
Model architecture
The architecture of the model was created by KANAK MITTAL, the author of the dataset used in this project. Consequently, my role is to elucidate the functionality of each line of code and propose potential reasons for the author’s decisions.
The architecture is divided into two parts,
Convolutional part:
- The Conv2D layer is where the convolution operations take place.
- The first Conv2D is particularly important because it is where we define the input shape of our data. In our case, the input consists of RGBA (4 layers) images with dimensions of 128×128 pixels.
- Our Conv2D layers use an increasing number of filters and kernel sizes. This choice of arguments is applied to each layer to capture a broader range of patterns from the data as it undergoes progressive simplification.
- The MaxPooling2D layer is employed to reduce the spatial resolution of the input, retaining only the maximum values within the pooling window.
- The idea behind Dropout is to randomly deactivate a fraction of neurons during training, thereby preventing overfitting.
Neural network part:
- To transition from the convolutional part to the neural network part, we need to use the Flatten layer to transform our multidimensional input into a single-dimensional format.
- Our hidden layers use the ReLU activation function f(x)=max(0,x) to introduce non-linearity. This function returns zero for negative input values and the input value itself for positive values.
- The number of neurons in each hidden layer descends to funnel the results, thereby reducing the computational load.
- The output layer utilizes 2 neurons (Anime and Cartoon) and the softmax activation function, which provides the probability distribution indicating the likelihood of the input belonging to each respective output.
Training
To enhance the performance of our training, we use the EarlyStopping callbacks to halt the training when there are no accuracy improvements, and the ModelCheckpoint callback to ensure the utilization of the best model from our training. These two functions from the keras library are used to use the best possible model
Results of the training
The training achieved an accuracy of almost 86% in the validation set on the 65th epoch. The model at this epoch is the one we saved due to early stopping.
After completing the model training, it is primed to predict results. However, for the model to make predictions, we require a specific type of input — the same type used during the training phase, which is a numpy array created from an image. To streamline the process and avoid repetitive data processing each time we use the model, we leverage a convenient function from the scikit-learn library called Pipeline.
The concept behind the Pipeline stems from chaining together various steps involved in creating our model into a single object. When invoked, this object already encapsulates all the necessary functionality for utilizing our model.
In the above code snippet, you’ll notice the import of a class called imageProcessor
. This class is exclusively designed for data normalization and is integrated into the normalizer step of our pipeline. The normalizer ensures that every time we utilize our model, the data we aim to predict is transformed into a valid input for our model.
Following the creation of the pipeline, predicting images with our model becomes a straightforward process. Importing our pipeline and the normalization class is all that’s required.
The results of the testing step is 87% of acurracy in all metrics of the classification_report, which is very similar to training results which is a positive indication because it means that our model was not overfitted during training.
Well, this must have been one of my most fun projects to develop, I felt motivated to move forward with each stage of the project and I am very satisfied with the results obtained, perhaps in the future I will use the results of this project in another one but only for the knowledge obtained was already worth my effort.
I have some thanks to my coworker Cláudio Henrique and my professor Mateus Arnaud Goldbarg for introducing me to the world of machine learning.
If you, dear reader, also speak Portuguese, read the article that my co-worker wrote:
contacts
My linkedin account: linkedin.com/in/valmir-francisco-581222288
My github account: github.com/valfra0425
Special thanks for my co-worker: linkedin.com/in/claudio-henrique-8047a7266
His github account: github.com/ClauHenrique
Files
Github project: https://github.com/valfra0425/cnn_animation
Drive folder with the project: https://drive.google.com/drive/folders/1jz8VZFOrKhZXD5tLX-jBWF4KMcWTrdpf?usp=sharing
Dataset: https://www.kaggle.com/datasets/kanakmittal/anime-and-cartoon-image-classification