
In the last article I wrote; Fundamentals of Machine Learning & AI; I went over the basics of how Machine Learning works using a simple example. In this article we will continue from where I left off and explain how we go from a machine learning model that predicts trends in datasets to Neural Networks & AI that seem to show signs of human-level intelligence. If you haven’t already, please go back and read that article first as most of the language and concepts that we go over will have been explained there. Now, let’s dive into what Neural Networks are and how they give rise to AI.
Where did we leave off last time?
Previously we went through a few mathematical steps and concepts to build a model that predicts patterns in a random dataset to the best of it’s ability in terms of accuracy. This process of finding the most accurate model for a dataset is what we defined as Machine Learning. This was a very simple example and is not exactly the best problem to use machine learning for. For instance, using a “line of best fit” for that type of problem would have been a good enough approximation and using machine learning techniques would not have been exactly practical. If using a complicated process only gets you a 1% or 2% better accuracy than it’s usually not worth doing in the first place. So to continue from where we left off we’ll have to use a different example to see the true potential of machine learning. In this new example, imagine we have a bunch of data on different types of buildings. Every building has a bunch of information associated with it such as square footage, price, number of bedrooms, number of bathrooms, etc.. In computer science, the different building types are known as a class and all the information about the buildings are known as it’s parameters. Mathematically though, the building can defined as y and the parameters can be defined as (x1, x2, …, xn). If you read my last article than this will look familiar to what we did with our other example except this time we are dealing with more parameters and the data is no longer random. For this example we don’t want to just find some pattern behind a random dataset but instead be able to predict the type of building we are dealing with based on the parameters or information we are provided (by type of building I mean is it a house, or townhouse, or maybe an apartment). This type of example is known as a classification problem. You can probably already see that using machine learning techniques would be a lot more effective for a problem like this since it would be difficult to use intuition alone to try and approximate a model, especially since we’re dealing with multiple factors and no longer just one.
So why don’t we use the same machine learning process like we did before?
Unfortunately, it’s not that simple. For this type of problem accuracy is extremely important. When trying to classify something like a building, if the model is only 70% accurate than it’s not really useful to anyone. For a model to be useful we would need an accuracy of at least 99%; we are looking for a model that’s almost always correct in it’s prediction. Although this might seem extreme, there’s actually a very good reason for this. If you had a dataset of say a million buildings and had a model that was 99% accurate than your model wouldn’t be able to correctly classify 10,000 buildings. As you can see even 99% accuracy isn’t really that great. The other issue is that even if we used machine learning techniques individually on each parameter, we would find that some models would have high accuracy and others wouldn’t. In other words some parameters have more predictive power than others. For instance, the square footage of a building is usually a lot more predictive of the type of building than something like the year it was built. We could simply just ignore the parameters that don’t give us a high accuracy or predictive model and only use the ones that do. This is what is known as feature engineering. Although this is a great strategy for a lot of other types of problems, it doesn’t work well for our problem. It would still not be enough to get us to the 99% accuracy model that we’re looking for as we’re eliminating all of the rare cases where some parameters actually have a large impact on the type of building. Essentially using basic machine learning techniques is not enough for classification problems. To solve a classification problem we need to use a neural network, the underlying architecture of AI.
What is a neural network?
For those unfamiliar, a neural network is the system used by our brains every day that makes our bodies and mind function. Our brains operate on approximately 86 billion cells called “neurons” that are all connected to one another through different paths in an extraordinary network. Hence the term neural network. Each of these individual neurons are responsible for processing both input and output signals that are passed on to other nearby neurons through chemical and biological processes. This system is extremely complex and to this day not well understood. The important thing to takeaway from this is the structure or architecture of how our brains are built. A neural network in computer/data science is designed in a similar fashion. You start with a couple neurons or nodes and line them up and connect them to an output. For our example, each node will be responsible for processing a single parameter in our dataset. You can think of each node as a separate machine learning process such that each node uses the Gradient Descent and RMSE method to calculate and optimize the weight of the model through multiple iterations. With each node performing it’s own calculations and then combining them to get a single output, we have done what we have previously considered as a solution to our problem. This method is known as linear combination and like we discussed, does not work for our example. To truly take advantage of neural networks it’s not enough to have just one layer of nodes; we need multiple layers. Not only do we need multiple layers but we also need to introduce something called an activation function. The reason for this is that no matter how many layers we add to our neural network, we will always be linearly combining the outputs before we pass it on to the next layer. As a refresher what I mean is y = w1*x1 + b1 + w2*x2 +b2 + … +wn*xn +bn. The accuracy of a model like this will never change or improve just by adding more node layers. I don’t want to go too deep into the math behind the activation function as there are many different types of them but the important thing to understand is that the activation function between layers allows the outputs to no longer be just linear. In other words, with the presence of an activation function between layers the neural network can change and adapt the models weights in any way the calculations deem necessary. Even with something as simple as linear models, through an iterative process a neural network can mold and shape the model to no longer be linear and therefore more accurately represent the trend of the dataset. I’m oversimplifying this process a bit but in essence this is the power of using a neural network.
So how would a neural network work with our example?
For our example, we would need to construct a similar neural network with an initial layer of nodes each represented by the parameters that we’re provided with. Then we would connect those nodes to a second layer of nodes with an activation function between them. A simple activation function we can use is a rectified linear unit or ReLU for short. Again, how this function works isn’t entirely too important, just understand that it gives the neural network the ability to make a lot more adaptive and accurate models. On a side note, the amount of nodes used in the second layer plays a big role in the efficiency of the neural network. Since a single machine learning process takes a lot of computing power, you can probably imagine that having multiple machine learning processes running at the same time in parallel is even more computationally intense. That is why minimizing the neural network size is important. So now that we have designed a neural network to process all our parameters and calculate a predictive model, we can start the training process. Training a neural network means supplying it with large datasets so that it can essentially learn and adjust the weights for each of the parameters. This is known as supervised learning. Just like in machine learning, through an iterative process the neural network will optimize finding the most accurate model available without the limitation of being forced to only use a linear function. Not only are neural network models not limited to simple functions, but they are also able to learn which weights for which parameters have more predictive power and adjust accordingly. After training the neural network on all our data, we can then test it’s performance on a validation dataset so that it wasn’t simply overfitting the training dataset. Congratulations, you technically just built your first AI. Now, although we did all these steps and built a neural network, it does not necessarily mean that we would get the 99% accuracy that we initially looking for. For that to actually happen would take a lot of training and fine-tuning of the processes we went through. For example, different network architectures, different activation functions, or even different initial functions such as sin(x) or cos(x). Not only that but you could also introduce cross-features that combine multiple parameters in single node. Fundamentally though the point is that building a neural network for a classification problem is a lot more accurate than if we only used basic machine learning techniques. That is why neural networks are the backbone of AI; it is because of their adaptability to determine predictive power and their model accuracy.