Life Cycle of a Data Science Project | by Ayushman Durgapal

Let’s take a deep dive into understanding the different steps involved in any data science project.

Hi, My name is Ayushman and this is my first blog on medium. I am a data science and analytics enthusiast and so, will be sharing whatever little I know in this field through my articles. Please do share if you like them and feel free to give your suggestions and feedback.

In this article I will explain the complete life cycle of a typical data science project.

Nowadays, “Data Science” has become the buzzword. Since the amount of data generated everyday is increasing at a staggering rate, leveraging its power can be of immense value.

“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.”

~ Eric Schmidt, Executive Chairman at Google.

The information gathered through this enormous amount of data has the power to bring about a revolution. Now, since we already know how relevant data can be of great value to us, let us dive into the steps/stages involved in a data science project:

Understanding the business problem
Collecting data
Preprocessing the data
Selecting the relevant features
Model Building and Testing
Model deployment

Understanding the Business Problem

This is the foundational step of any project. Correctly understanding the problem statement is one of the most important steps which enables us to take the right decisions further as we advance towards solving the problem. Asking the right kind of questions as to what we really want to solve and where we want to reach is what is required in the starting stage. Having some domain knowledge or some business context can help us understand and define the problem better.

Collecting Data

Gathering data is the next step where we collect the relevant data required to solve the business problem. There can be a lot of sources from where data can be collected and used for our analyses. There can broadly be two main ways to collect data, namely, Primary and Secondary.

Primary data is the type of data that has been collected specifically for the project and has not been used in the past. Interviews, observations, surveys and questionnaires, focus groups, etc. are some primary data collection methods.

On the other hand, Secondary data refers to the data that has already been collected by someone else and is of value for our project. Hence, it is easier to collect this type of data. This type of data can be collected from the internet, government archives, etc.

To gather data, we might need to query databases. This would require a good knowledge of Structured Query Language (SQL). Apart from this, data can also be found as flat files like csv (comma separated values) or tsv (tab separated values) in various repositories on the internet like UCI ML repository or on websites like Kaggle.

Another way to gather data is through web scraping. There are various web scraping tools to scrape data from the websites. However, there are some sites that do not allow users to scrape data. Data can also be collected using third-party APIs. A lot of sites like Facebook, Yahoo! Finance, Twitter, etc. allow users to access their data.

Preprocessing the Data

This is arguably one of the most important and time consuming parts of any data science project. Processing the data before feeding it to the model can greatly affect its accuracy. Hence, this step should be carried out with utmost caution and attention. According to my understanding, I have divided data preprocessing into two parts — Data Cleaning and Exploratory Data Analysis (EDA). Let’s dive deeper into each of these:

Data Cleaning

The data we gather can be really messy and so, cleaning it becomes really crucial. Handling missing values, handling improper data types (for eg. changing dates from string type to date), splitting or merging columns, removing duplicates, and dealing with outliers are some of the operations that are involved in data cleaning. There are a lot of ways to perform these operations through programming languages like Python or R.

EDA

Once data cleaning is done, the next important step is to explore the data. Here is when we actually deep dive into the data. Exploring data basically means to see the type of data we have. For example, calculating the mean and the spread of data (for columns with numeric data), checking the number of categories and their distribution throughout the dataset (in case of categorical columns), etc. EDA can be done by creating different visualizations like bar charts, pie charts, scatter plots, box plots, line charts, and more. EDA can be univariate, bivariate or multivariate. Exploring data can help us find outliers which also need to be handled before feeding the data to the model. Depending on the type of data as well as the number of data points, outliers can be handled in different ways.

Selecting the relevant features

Features are basically the independent variables that predict the output. A dataset might contain a lot of features, but not all of them are useful for making predictions. There are only certain relevant features that are required for robust predictions. It should always be our goal to create a generalized model that has a low bias and low variance i.e. the model should have low error for both training and testing data. While building a model, the accuracy of the model increases by adding more features only till a particular number. After this the accuracy decreases. This is known as the curse of dimensionality. Moreover, having too many features that actually do not provide any information, increases the training time as well. Thus, a careful selection of features is important.

There are broadly three methods to select features:

1. Filter Methods

These methods are quick and less computationally expensive. These can be used even with a large number of features. Use of statistical methods like ANOVA, Chi-squared, fisher score, mutual information gain, variance, correlation, etc. for selecting features are a part of filter methods.

Many features are often correlated with each other (this is also known as multicollinearity) and so, keeping all of those can affect the accuracy of the model. The feature that has the highest correlation with the output should be kept and other correlated features should be dropped. Sometimes individual features might not have high correlation with the target variables but combining certain features together can be useful in making predictions.

All these methods are easy and for quick screening and hence, are widely used.

2. Wrapper Methods

These methods are computationally more expensive than filter methods and are not recommended for a large number of features. Large number of features must first be screened out using filter methods before using wrapper methods. Unlike filter methods, these methods use machine learning algorithms to select the best subset of features for the model and thus, have a better performance. There are mainly three types of wrapper methods:

Forward Step Selection — In this technique, the best performing feature is identified. In the next step, this feature is used with other features and the combination giving the best accuracy is chosen. This process is repeated in the subsequent steps to select the best subset of features.
Recursive Feature Elimination (Backward Step Selection) — This process is the opposite of the above. First, all the variables are used and the least useful features are eliminated in the subsequent steps to select the best performing subset.
Exhaustive Feature Selection — This is the most computationally expensive method of all and takes a lot of time to execute. This method tries to calculate the accuracy for all the possible combinations of the features. This means that if there are n features, then the model will be trained 2^n times (total possible combinations) to select the best subset.

3. Embedded Methods

These methods are called so because they perform feature selection during the implementation of the machine learning algorithm. They are faster than wrapper methods. L1 Regularization ( Lasso Regression), L2 Regularization (Ridge Regression), and Extra Trees are some embedded methods.

Regularization methods basically penalize the less important features and thus help in selecting relevant features.

Model Building and Testing

Model building is the fun and exciting part where a suitable model is selected and the clean data is fed to train the model. Based on the model that has been selected, we might need to perform feature scaling. Scaling of features is done to get the values of all features in a particular range. Gradient Descent based algorithms like linear regression, neural networks, Distance based algorithms like KNN, K-Means Clustering, etc. require feature scaling for the model to perform better. On the other hand, Tree based algorithms like Decision Tree and Random Forest are not affected by the range of different features. There can be various methods of feature scaling like Standardization, Normalization, and so on.

Now, in some cases we might split the data into training and testing sets or might have completely separate data as our test data. Any model that we choose for the project has some hyperparameters that are tuned to make the model perform optimally. After the model has been trained, it is tested against the test dataset. There are a bunch of metrics to check the performance of the model like Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), etc. for regression models and Accuracy, Precision, etc. for classification models.

After these scores are calculated, the hyperparameters are tuned accordingly to get better accuracy. Model building and testing is basically an iterative process which involves tuning and testing of the model until we get the highest performance. The goal is to create a simple and generalized model that performs well for both training and testing data (in data science lingo, a model that has low bias and low variance).

Hence, it is extremely important to have knowledge of the parameters of different models so that they can be tweaked to get the desired results. Certain libraries in scikit-learn like RandomizedSearchCV and GridSearchCV make our job easy in selecting the correct parameters. They basically train the model using different combinations of values of the hyperparameters entered by the user and return the values that fetch the best results. This step might take a lot of time depending on the size of the data, the number of features, and the number of parameters whose values are to be fine tuned.

Model Deployment

After the model has been trained and we are satisfied with its performance, we come to the last stage of the project. This is basically the step where we make our project presentable and fit to be used by the client. In case of an industry-level project, before actually starting with the deployment, there can be some discussions with clients regarding where they want to deploy the model, the formatting, the functionality, etc. All the code runs in the backend and the users interact through an interface designed to be understood by them. Data Science projects can be deployed on cloud platforms like AWS, Microsoft Azure, or Google Cloud Platform (GCP). A framework using Flask or Django is also very common to deploy data science projects.

After proper deployment of the model, it is fit to be used. But a proper and timely maintenance of the model is also necessary. There can be various factors that might affect the accuracy of the model in the future. So, a regular check on the performance of the model, updating the data source, and hyperparameter tuning becomes really important.

These were broadly the steps that are involved in a typical data science project. There could be a few more little steps that may be required depending on the level and complexity of the project.

Follow for more amazing articles on data science!