Tuesday, November 9, 2021

K Nearest Neighbors (KNN) intution

 


K Nearest Neighbor (KNN)


In k-NN regression, the output is the property value for the object.

This value is the average of the values of k nearest neighbors.


What is KNN?


Machine learning models use a set of input values to predict output values.

KNN is one of the simplest forms of machine learning algorithms mostly used for classification.

It classifies the data point on how its neighbor is classified


Introduction:


K Nearest Neighbor algorithm falls under the Supervised Learning category and is used for classification (most commonly) and regression. 

It is a versatile algorithm also used for imputing missing values and resampling datasets. As the name (K Nearest Neighbor) suggests it considers K Nearest Neighbors (Data points) to predict the class or continuous value for the new Datapoint.

How to choose the value for K?¶

Using error curves: The figure below shows error curves for different values of K for training and test data. At low K values,

there is overfitting of data/high variance. Therefore test error is high and train error is low. At K=1 in train data, 

the error is always zero, because the nearest neighbor to that point is that point itself. 

Therefore though training error is low test error is high at lower K values. This is called overfitting. 

As we increase the value for K, the test error is reduced.

But after a certain K value, bias/ underfitting is introduced and test error goes high.

So we can say initially test data error is high(due to variance) then it goes low and stabilizes and 

with further increase in K value, it again increases(due to bias). The K value when test error stabilizes and

is low is considered as optimal value for K. From the above error curve we can choose K=8 for our KNN algorithm implementation.



Usage of KNN


WThe KNN algorithm can compete with the most accurate models because it makes highly accurate predictions.

 Therefore, you can use the KNN algorithm for applications that require high accuracy but that do not require a human-readable model.


The quality of the predictions depends on the distance measure. Therefore, the KNN algorithm is suitable for applications for which sufficient domain knowledge is available. 

This knowledge supports the selection of an appropriate measure.


The KNN algorithm is a type of lazy learning, where the computation for the generation of the predictions is deferred until classification.

 Although this method increases the costs of computation compared to other algorithms, 

 KNN is still the better choice for applications where predictions are not requested frequently but where accuracy is important.


This is the first and most important parameter as it refers to the part that 

if we should use the KNN algorithm or not as many other algorithms can be used for classification.

 The main advantage of KNN over other algorithms is that KNN can be used for multiclass classification.

 Therefore if the data consists of more than two labels or in simple words if you are required to

 classify the data in more than two categories then KNN can be a suitable algorithm.


When Data is labelled:


What do we mean when we say data is labelled? It means we already know the results of the data for a particular 

datasets under our analysis and based on this we are trying make our model learn how to classify the future unknown data.


When Data is noise free:

Noisy data is data with a large amount of additional meaningless information in it. 

This includes data corruption and the term is often used as a synonym for corrupt data. 

It also includes any data that a user system cannot understand and interpreted correctly.

 Now you know that Noise (in the data science space) is unwanted data items,

 features or records which don’t help in explaining the feature itself, or the relationship between feature & target.

 Noise often causes the algorithms to miss out patterns in the data.


Noise in tabular data can be of three types:


§ Anomalies in certain data items (Noise 1: certain anomalies in features & target)


§ Features that don’t help in explaining the target (Noise 2: irrelevant/weak features)


§ Records which don’t follow the form or relation which rest of the records do (Noise 3: noisy records)


Note: Features means an individual measurable property or characteristic of a phenomenon being observed. 

The data features that you use to train your machine learning models. In future I will write on Feature Selection Techniques,

 Noise identification and Dealing with Noisy Data in Machine Learning

 

 When Dataset is small:

If dataset is too big KNN may underperform as KNN is lazy learner and

 does not learn a discriminative function from training set.


Limitations of the KNN algorithm:


As it is clear that the KNN algorithm internally calculates the distance between the points, 

it is therefore obvious that the time taken by the algorithm for classification will be more as compared to other algorithms in certain cases. 

It is advised to use the KNN algorithm for multiclass classification if the number of samples of the data is less than 50,000. 

Another limitation is the feature importance is not possible for the KNN algorithm. 

It means that there is not an easy way which is defined to compute the features which are responsible for the classification.


Required Data Preparation:

Data Scaling: To locate the data point in multidimensional feature space, 

it would be helpful if all features are on the same scale. Hence normalization or standardization of data will help.


Dimensionality Reduction: KNN may not work well if there are too many features. 

Hence dimensionality reduction techniques like feature selection, principal component analysis can be implemented.


Missing value treatment: If out of M features one feature data is missing for a particular example in the training set then

 we cannot locate or calculate distance from that point. Therefore deleting that row or imputation is required.

No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...