Hi,
I have a doubt if suppose we have a dataset and each feature has wide range of numerical values. If we do standardization or normalization to only one or two features/columns. Then whether we have to perform for all features?
Also Can I know which models need pre-processing of data using these techniques before modelling?
1 Like
I think standardization or normalization of column required based on the column you decide to use for your model. Let’s say you have 15 columns but you are just using 2-3 columns for your model prediction then there is no point to normalize / standardize all columns.
If I understand correctly your second part then I believe these data cleaning / pre-processing is independent of the model which you use. These pre-processing is required to understand your data better and get more accurate prediction.
1 Like
Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais.
For example, A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization will give the variable with the larger range weight of 1000 in the analysis. Transforming the data to comparable scales can prevent this problem.
Therefore we need standardisation or normalisation.
Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian.
Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian.
Any machine learning algorithm that computes the distance between the data points needs Feature Scaling (Standardization and Normalization). This includes all curve based algorithms.
Example:
-
KNN (K Nearest Neigbors)
-
SVM (Support Vector Machine)
-
Logistic Regression
-
K-Means Clustering
Algorithms that are used for matrix factorization, decomposition and dimensionality reduction also require feature scaling.
Example:
-
PCA (Principal Component Analysis)
-
SVD (Singular Value Decomposition)
3 Likes
@datakatha Thank you so much. I understood the point
@asha_pareek Thank you for your answer! I have now got an idea about these techniques