Feature Engineering

Rupesh
6 min readSep 7, 2019

Figure out quality and quantity of the feature from Data

Actually the success of all Machine Learning algorithms depends on how you present the data.

This is the continuation of my previous post. In this post, we are going to deal with the theory behind feature engineering.

Features in data refers to columns. It is the process of transforming the given data into a new form which is easy to interpret and understand. It is an important phase in the pipeline of building a ML model, since it creates a difference between a good and bad model. Lets dwelve deeper into the impact it has on the model performance.

Handling Missing(Nan) Values

Before looking into the data, try to know if there is any missing value in features. If you start analysing the data and later you find that if some values are missing, you need to handle it and start analysis once again from scratch. Try to figure out how many missing values or information is in our data and how many features are affected by it.

Figure out the answer => Why the data is missing?

It is the important question we need to keep in mind and start looking into the data. Is the Data missing because it does not exist or it is not documented?

By finding the answer to the same question the techniques to handle the missing data depend.

Source : https://towardsdatascience.com/gans-and-missing-data-imputation-815a0cbc4ece

Consider the scenario where you have the data about a class of 100 students which includes height , weight and age. Let us say if the height of 5th record is missing then it is because an error occurred while documenting. In such cases, we fill the data by taking the mean of the column and replace NaN with mean or build a ML model with other factors as independent feature and height as dependent so we can predict those values or if similar data like same weight and age value exist with height in the data we can use those values. We will discuss it briefly in the next part.

Another scenario, consider you encounter a Movie rating data.In general rating related data will hold large amount of NaN values because we would not rate all products we use. This is the place where the data does not exist. In such cases if you try to implement the above mentioned methods it will collapse the model and story of the data. When I faced this issue in dealing with movie ratings I replaced NaN with Zero, reason being, the rating range is from 1 to 10 and so there is nothing called Zero rating. Now it makes sense that Zero rating implies that he/she has not watched the movie.

Now lets discuss about few technique used to handle missing values.

Drop the Features

If you feel that most of the values in a feature is NaN, it is better to drop that column from dataset. It is also applied to a record(rows) which holds multiple NaN values. This is because, if we try to fill most of the values, it may affect the nature of entire data. This technique is applied only if you have some large amount of data.

Replace with values

The term values mean that most frequent values occur in column and mean of the column but try one at a time. In case of small dataset if we delete a column or row it may lead to loss of information. So use this method to handle it. If you have many NaN values this method is not recommended as it may lead to Imbalanced Data

Predict Missing Values

Consider that If you have a training set in which one independent column holds some Nan values rest are not. Then try to build a model from the rest of data by considering them as independent feature and predict the missing data. This technique is used to predict missing categorical data since only classifier models are used here. In some cases we implement unsupervised models instead of supervised classifier which is also an option.

The another important part of feature engineering is,

Handling Imbalanced Data

Imbalance means the data in dataset is biased to one specific group or target class. It exists in most of the real time problems. The reason for imbalance is that the minority class events have not occurred in regular intervals. But the interesting part is the ML model need to predict that minority class in a more accurate manner. This is because, when we consider a case where 95% of data belong to class A and rest part is B. If Model says every thing is class A then its accuracy is 95%. For such prediction we don’t need a ML model we can label it with automation scripts.Here comes the part of data handling . In general, there are some techniques,

Over Sampling

It is used to increase the records of minority class by replicating them until it reaches the another class ratio. But it is just duplication of data. It is not a good idea because in data cleaning we remove duplicate records but here we add duplicate records. It work may any changes in models while training.

Source:http://www.treselle.com/blog/handle-class-imbalance-data-with-r/

Under Sampling

Source : http://www.treselle.com/blog/handle-class-imbalance-data-with-r/

It is the process of maintaining the distribution of records across different classes by eliminating the records in majority class. But the drawback is it leads to data loss and information loss. This loss will affect Model performance.

SMOTE =>Synthetic Minority Over-sampling Technique

It is an Over Sampling technique which increase the count of minority class not by duplicating it .It will generate new data based on existing data. Data generation like this will consider that you plot your data in n-dimension and it will take two data point in serial and pick the new data point from the centre of two data point as see in fig

Source:https://www.arxiv-vanity.com/papers/1711.00837/#id7

The important factor is that the new data generated is assumed that it belongs to class minority. But in some cases our assumption may be wrong. Since these are not real data it is just synthetic data. Do proper analysis on newly generated data because it will be included in training part.

Ensemble Techniques

It is an alternative approach to handle same problem. Until now we discussed about how to do data handling. But now we build a ML Model which will help to solve our problem. Here we Build multiple ML Models with same training data and make to vote. For example consider we can build KNN, SVM,Naive Bayes and Logistic Regression Classifier with the same data and allow them to predict a new data. Among all classifier predict which class has got more vote consider it as the predicted class.

Source : http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/

In my next post I will include the coding part for all the concepts that we discussed here. Try to be on both Python and R.

I would be very happy to receive your queries and comments. Kindly post in the comments section. Do follow me for more ML/DL blogs.

--

--