Air Pollution Prediction. Air is relevant for humans, and since… | by Claudio Giorgio Giancaterino

Air is relevant for humans, and since industrialization, there has been an increasing environmental pollution. Million premature deaths annually are caused by bad environment conditions, so air pollution is the world’s largest single environmental risk. Monitoring and understanding air quality is crucial for health and climate change.

Air pollution results are coming from the combination of high emissions and unfavorable weather. The improvements in air quality may be modulated by changes in climate. The major responsible substances of air pollution are Ozone (O3), Nitrogen dioxide (NO, NO2), Carbon Monoxide (CO), Sulphuric oxide (SO2), Particular matter (PM10, PM2.5). Nowadays pollution levels are increasing due to the PM 2.5 gases which affect the heart problems, cancer and other breathing issues causing million deaths. The first line of defence against this deaths is ambient air quality standards. Among the different types of air pollution, PM 2.5 is killing the most people worldwide. It made up by particles smaller than approximately 2.5 microns. Many regions with the most air pollution, such as the Middle East, don’t even measure PM 2.5 air pollution, and in countries like China and India, the weakest air quality standards are often violated. In contrast, the strictest standards are often met, in places like Canada and Australia. More than half of the world urgently needs protection in the form of adequate PM 2.5 ambient air quality standards, and the adoption of these standards will save lives.

Air quality forecasting that uses predictive models has the opportunity to be helpful for humans health and for the environment in periods of high pollution. The improvement of air quality projections is becoming a tool to fight air pollution. This job has followed the aim to explore and predict hourly air quality Index (AQI) data in India with a data set coming from Kaggle and the following notebook has been produced.

The data set has 36.192 rows and 5 variables more the timestamp index column. It is essentially a univariate time series because the target variable object of the study is the PM 2.5, the fine particulate matter air pollutant level in air, meanwhile the others features are retrieved from the timestamp, and not used in the research. Looking at the series, it follows a right skewed distribution, it shows outliers, high autocorrelation, and a cyclic behaviour over time. Data are not equally distribuited over the period, there are inconsistent obervations at the the first year and at the last year of the series.

The year with the highest pollution is 2017 and the lowest pollution year is 2020, the highest polluted month is December with 78.07 PM2.5 and the lowest polluted month is August with 21.52 PM2.5. During the day the most polluted hour is at 18, meanwhile the lowest polluted hour is at midday.

In the decomposition of the time series, there is no evidence of seasonality patterns, trend has the same high width cyclic shape of the series both in the additive and multiplicative decomposition and doesn’t show a specific up or down patterns. It’s possible to observe an high level of pollution during the beginning of the year meanwhile the low level is observed during the summer period. Last analysis has been done looking at the stationarity of the series. From the rolling window mean of observations, increasing window sizes have smoothed the shape over the time. This allows the model to make use of behaviours seen at different time scales. Checking consistent summary statistics at the partitioned series, as mean and variance over time, suggest to apply the log transformation at the target values, also confirmed looking at the histogram of the air pollution distribution. At the end, the Augmented Dickey-Fuller test, that uses an autoregressive model, reject the null hypothesis that means the time series is stationary, it doesn’t have time dependent structure.

Before to ingest data into models is necessary to have a data preparation and in this job have been allocated two activities: removing outliers and features engineering. Outliers in time series are anomaly observations on patterns and trends collected in the values of the time series. Detecting outliers in time series is helpful because outliers can influence the forecast model that is used to predict future values. OutlIers have been detected and removed by the Hampel identifier, replacing with the median each sample that differ from the median by more than three times the median absolute deviation. The approach pursued has been to transform the time series forecasting into a supervised learning problem, and for this purpose there have been generated lag features with shifted values following the sliding window method, so lag features generated are the PM2.5 values shifted by 1 hour, 2 hours, 3 hours, 1 day, a week and the rolling mean, rolling max, rolling min of one day and a week. From the visualization chart are clearly visible outliers in 2019, 2020, and 2021 with PM2.5 values greater than 150.

The forecast is refered to the least moth of 2021 and for the prediction has been used an econometric model, the Autoregressive Model (AR), as a benchmark and three supervised learning models: Random Forest (RF), Explainable Boosting Machine (EBM) and a Shallow Neural Network (NN). The autoregressive model is used as a benchmark model for this research, as the time series is stationary.

The autoregressive process of order p is denoted as AR(p) and is defined as:

where ϕ1, …ϕr are parameters of the model, and ϵt is the white noise.

The autoregressive model uses observations from previous time steps, called lag variables, as input variables in the regression equation to predict the value at the next time step. This process is derived from linear regression, because the model uses data from the same input variable at previous time steps, so it’s an autoregression.

# train autoregression
AR = AutoReg(np.log1p(y_train), exog=exog_train, lags=3)
AR_model = AR.fit()
# make prediction
y_hat = AR_model.predict(start=len(y_train), end=len(y_train)+len(y_test)-1, exog_oos=exog_test)# Evaluation
print('Test evaluation')
rmse = np.sqrt(mean_squared_error(np.log1p(y_test.values), y_hat.values))
print(f"root_mean_squared_error: {rmse:.3f} ± {rmse.std():.3f}")
mae = mean_absolute_error(np.log1p(y_test.values), y_hat.values)
print(f"mean_absolute_error: {mae:.3f} ± {mae.std():.3f}")
Test evaluation
root_mean_squared_error: 0.195 ± 0.000
mean_absolute_error: 0.156 ± 0.000

The first supervised learning approach is coming from Random Forest (RF), a machine learning that belongs to the ensemble trees family. Ensemble methods work with the principle that a group of “weak learners” can come together to form a “strong learner”. Each decision tree individually, is a “weak learner,” while all the decision trees taken together are a “strong learner”. A weak learner has a poorly performance, meanwhile a strong learner has a good score. Tree-based methods involve stratifying the feature space into a number of simple regions defined according to the explanatory variables and the splitting rules can be summarized in a tree view, for this reason these types of approaches are called decision trees. Decision trees are typically drawn upside down, it means that the root node is at the top of the tree and contains the whole population, and then with splitting rules are created subgroups named internal node. Going deeply with decision rules are reached leaf nodes or terminal nodes at the bottom of the tree without further splits. The idea is to divide the feature space into distinct and non-overlapping regions. For every observation that falls into the region is made the same prediction, set as the region average, namely the mean of the response values for the training observations in the region. These regions could have any shape. The choice is to divide the feature space into high-dimensional rectangles with the goal to find rectangles that minimize the loss function, the residual sum of squares. Given it’s computationally expensive to consider every possible partition of the feature space into regions is taken a top-down, greedy approach that is known as recursive binary splitting. Tree-based methods are very easy to explain to people and can be displayed graphically, but suffer of instability, the estimated tree structure can be sensitive to sample of data, a small change in the data can cause a large change in the final estimated tree. Cross-validation and the ensemble method help to address this problem. Random Forest, introduced by Breiman in 2001, is able to fix the issue of high variance of trees.

It starts creating multiple copies of the original training data set using the bootstrapping, fits a decorrelated decision tree for each bootstrapped copy, and then average all of the trees in order to create a single predictive model. Moreover building these regression trees, each time for a split in a tree is considered a random selection of m predictors as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors.

# evaluation
RF = RandomForestRegressor(random_state=0)
evaluate(RF, X, np.log1p(y), cv=ts_cv)Train evaluation
root_mean_squared_error: 0.031 ± 0.000
mean_absolute_error: 0.021 ± 0.000
Test evaluation
root_mean_squared_error: 0.075 ± 0.007
mean_absolute_error: 0.053 ± 0.003

Another supervised learning approach is coming from the Explainable Boosting Machine (EBM) that is a result of the combination of high performance by ensemble trees and interpretability by Generalized Additive Models (GAMs).

The Explainable Boosting Machine is an extension of the Generalized Additive Models into two directions:

fits each function fj by a combination of modern machine learning models like bagging and boosting;
detects and fits pairwise relationships functions fij between variables.

The function associated at each predictor is generated by many small trees trained with a gradient boosting in an iterative way for each feature one after another with a low learning rate ensuring that the model is additive, and each shape function uses only one feature at time. Once a model is trained with individual features, a second step is run with the same training process, but looking at pairs of features. After these steps, all the trees produced for a single feature are used to predict the training samples building the function linked at each predictor, and the same for pairs of features. The mentioned Boosting, introduced by Jerome Friedman, is an alternative method of bagging for building an ensemble trees. Bagging as well as Random forest are built on a large number of trees, combined using averages at the end of the process while Gradient Boosting Machine also combine regression trees, but it starts combining process at the beginning. It works sequentially, fitting a model based on the residuals from the previous models, in this way the model is able to focus in areas where previous trees have performed more poorly.

# Evaluation
EBM = ExplainableBoostingRegressor(random_state=0)
evaluate(EBM, X, np.log1p(y), cv=ts_cv)Train evaluation
root_mean_squared_error: 0.085 ± 0.000
mean_absolute_error: 0.060 ± 0.000
Test evaluation
root_mean_squared_error: 0.081 ± 0.006
mean_absolute_error: 0.060 ± 0.003

In the last model have been studied Neural Networks that are inspired by neuroscience, they are formed by neurons connected in various ways. Neuron is the basic building block of the Neural Networks.

Each neuron represents a simple computational unit that has weighted input signals and produce an output signal using an activation function. Neurons are stacked into networks to create complex architectures. A row of neurons is called a layer and one network can have multiple layers. In every layer there is a linear combination of weights and input features or latent features coming from preceding layers and then is applied an activation function extending linear models. Artificial Neural Networks have an input layer with as many neurons as the number of explanatory variables. The input layer is followed by one or several hidden layers with a number of neurons. The number of neurons in the output layer equals the number of dependent variables. These type of models are called feed-forward because information flows in one direction, from the input, through the hidden layers, and to the output. There are two types of Neural Networks: Shallow Neural Networks which have one hidden layer, and Deep Neural Networks which have more than one hidden layer. For this job has been employed a Shallow Neural Network with one hidden layer.

# evaluation
NN = MLPRegressor(hidden_layer_sizes=(15), activation='relu', solver='adam', max_iter=500,random_state=0)
evaluate(NN, X_sc, np.log1p(y_sc), cv=ts_cv)Train evaluation
root_mean_squared_error: 0.022 ± 0.000
mean_absolute_error: 0.015 ± 0.000
Test evaluation
root_mean_squared_error: 0.022 ± 0.006
mean_absolute_error: 0.015 ± 0.003

For the evaluation performances’ models have been used the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE).

From the analysis Neural Networks seems to be the best performing model. The Autoregressive model, as a benchmark, shows poor result. Random Forest and Explainable Boosting Machine follow in a good way shape of observations, but the previous one overfit. Neural Networks are able to capture spikes in the observations following the shape of them. In all models the lag feature shifted by an hour is the most relevant to explain the outcome, the others change depending by the model. From the Partial Dependence Plot there is a confirm about the positive correlation between the ”lagged 1h” feature and the air pollution.

Time series forecasting is a research field able to give a relevant contribute in fighting air pollution. Modern machine learning models have shown a potential of prediction helpful to forecast air pollution projection. This good accuracy is reflected in right strategy able to monitor, and prevent air pollution. Not only, is useful for public managers in the allocation of resources, saving waste expenses with a better focus on healthcare. Better prediction of air pollution gives the opportunity to save many lives from illness and to improve climate change. In this job has been used a univariate time series, but with many other features there would be the opportunity to explore relationships between air pollution and other variables, improving the efforts in the pollution prevention and for a better risk assesment. Moreover with geographical information would be interesting to cluster areas requiring most efforts.

References: