Data Analytics: November 2021

Wednesday, November 10, 2021

Bagging in Machine Learning

What Is Ensemble Learning?

* Machine Learning uses several techniques to build models and improve their performance.

* Ensemble learning methods help improve the accuracy of classification and regression models.

* Ensemble learning is a widely-used and preferred machine learning technique in which multiple individual models,

often called base models, are combined to produce an effective optimal prediction model.

* The Random Forest algorithm is an example of ensemble learning.

What Is Bagging in Machine Learning?

* Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and

accuracy of machine learning algorithms.

* It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model.

* Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.'

What Is Bootstrapping?

* Bootstrapping is the method of randomly creating samples of data out of a population with replacement to estimate a population parameter.

Steps to Perform Bagging

* Consider there are n observations and m features in the training set.

* You need to select a random sample from the training dataset without replacement

* A subset of m features is chosen randomly to create a model using sample observations

* The feature offering the best split out of the lot is used to split the nodes

* The tree is grown, so you have the best root nodes

* The above steps are repeated n times.

* It aggregates the output of individual decision trees to give the best prediction

Advantages of Bagging in Machine Learning

* Bagging minimizes the overfitting of data

* It improves the model’s accuracy

* It deals with higher dimensional data efficiently

Random Forest intution in Machine Learning

What is a random forest?

* A random forest is a machine learning technique that’s used to solve regression and classification problems.

* It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.

* A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating.

* Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.

* The (random forest) algorithm establishes the outcome based on the predictions of the decision trees.

* It predicts by taking the average or mean of the output from various trees.

* Increasing the number of trees increases the precision of the outcome.

*A random forest eradicates the limitations of a decision tree algorithm.

* It reduces the overfitting of datasets and increases precision.

* It generates predictions without requiring many configurations in packages (like scikit-learn).

*A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms.

* This algorithm is applied in various industries such as banking and e-commerce to predict behavior and outcomes.

Features of a Random Forest Algorithm

*It’s more accurate than the decision tree algorithm.

*It provides an effective way of handling missing data.

*It can produce a reasonable prediction without hyper-parameter tuning.

*It solves the issue of overfitting in decision trees.

*In every random forest tree, a subset of features is selected randomly at the node’s splitting point.

Working of Random Forest Algorithm

* Before understanding the working of the random forest we must look into the ensemble technique. Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.

* Ensemble uses two types of methods:

Bagging– It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example, Random Forest.

Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST

Bagging

*Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set.

*Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling.

*This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results.

*The final output is based on majority voting after combining the results of all models.

*This step which involves combining all the results and generating output based on majority voting is known as aggregation

Steps involved in random forest algorithm:

Step 1: In Random forest n number of random records are taken from the data set having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

Important Features of Random Forest

Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.

Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.

Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.

Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.

Stability- Stability arises because the result is based on majority voting/ averaging.

Difference Between Decision Tree & Random Forest

Decision trees

------------------------------------------------------------------------------------

1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control

2. A single decision tree is faster in computation.

3. When a data set with features is taken as input by a decision tree it will formulate some set of rules to do prediction

Random Forest

-----------------------------------------------------------------------------------

1. Random forests are created from subsets of data and the final output is based on average or majority ranking and hence the problem of overfitting is taken care of.

2. It is comparatively slower.

3. Random forest randomly selects observations, builds a decision tree and the average result is taken. It doesn’t use any set of formulas.

Important Hyperparameters

Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.

Following hyperparameters increases the predictive power:

n_estimators– number of trees the algorithm builds before averaging the predictions.

max_features– maximum number of features random forest considers splitting a node.

mini_sample_leaf– determines the minimum number of leaves required to split an internal node.

Following hyperparameters increases the speed:

n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor

but if the value is -1 there is no limit.

random_state– controls randomness of the sample.

The model will always produce the same results if it has a definite value of random state

and if it has been given the same hyperparameters and the same training data.

oob_score – OOB means out of the bag. It is a random forest cross-validation method and

In this one-third of the sample is not used to train the data instead used to evaluate its performance.

These samples are called out of bag samples.

Applications of random forest

Some of the applications of the random forest may include:

Banking

Random forest is used in banking to predict the creditworthiness of a loan applicant. This helps the lending institution make a good decision on whether to give the customer the loan or not. Banks also use the random forest algorithm to detect fraudsters.

Health care :

Health professionals use random forest systems to diagnose patients. Patients are diagnosed by assessing their previous medical history. Past medical records are reviewed to establish the right dosage for the patients.

Stock market :

Financial analysts use it to identify potential markets for stocks. It also enables them to identify the behavior of stocks.

E-commerce:

Through rain forest algorithms, e-commerce vendors can predict the preference of customers based on past consumption behavior.

When to avoid using random forests

Random forest algorithms are not ideal in the following situations:

Extrapolation::

Random forest regression is not ideal in the extrapolation of data. Unlike linear regression, which uses existing observations to estimate values beyond the observation range. This explains why most applications of random forest relate to classification.

Sparse data::

Random forest does not produce good results when the data is very sparse. In this case, the subset of features and the bootstrapped sample will produce an invariant space. This will lead to unproductive splits, which will affect the outcome.

Advantages of random forest

* It can perform both regression and classification tasks.

* A random forest produces good predictions that can be understood easily.

* It can handle large datasets efficiently.

* The random forest algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.

Disadvantages of random forest

* When using a random forest, more resources are required for computation.

* It consumes more time compared to a decision tree algorithm.

Tuesday, November 9, 2021

K Nearest Neighbors (KNN) intution

K Nearest Neighbor (KNN)

In k-NN regression, the output is the property value for the object.

This value is the average of the values of k nearest neighbors.

What is KNN?

Machine learning models use a set of input values to predict output values.

KNN is one of the simplest forms of machine learning algorithms mostly used for classification.

It classifies the data point on how its neighbor is classified

Introduction:

K Nearest Neighbor algorithm falls under the Supervised Learning category and is used for classification (most commonly) and regression.

It is a versatile algorithm also used for imputing missing values and resampling datasets. As the name (K Nearest Neighbor) suggests it considers K Nearest Neighbors (Data points) to predict the class or continuous value for the new Datapoint.

How to choose the value for K?¶

Using error curves: The figure below shows error curves for different values of K for training and test data. At low K values,

there is overfitting of data/high variance. Therefore test error is high and train error is low. At K=1 in train data,

the error is always zero, because the nearest neighbor to that point is that point itself.

Therefore though training error is low test error is high at lower K values. This is called overfitting.

As we increase the value for K, the test error is reduced.

But after a certain K value, bias/ underfitting is introduced and test error goes high.

So we can say initially test data error is high(due to variance) then it goes low and stabilizes and

with further increase in K value, it again increases(due to bias). The K value when test error stabilizes and

is low is considered as optimal value for K. From the above error curve we can choose K=8 for our KNN algorithm implementation.

Usage of KNN

WThe KNN algorithm can compete with the most accurate models because it makes highly accurate predictions.

Therefore, you can use the KNN algorithm for applications that require high accuracy but that do not require a human-readable model.

The quality of the predictions depends on the distance measure. Therefore, the KNN algorithm is suitable for applications for which sufficient domain knowledge is available.

This knowledge supports the selection of an appropriate measure.

The KNN algorithm is a type of lazy learning, where the computation for the generation of the predictions is deferred until classification.

Although this method increases the costs of computation compared to other algorithms,

KNN is still the better choice for applications where predictions are not requested frequently but where accuracy is important.

This is the first and most important parameter as it refers to the part that

if we should use the KNN algorithm or not as many other algorithms can be used for classification.

The main advantage of KNN over other algorithms is that KNN can be used for multiclass classification.

Therefore if the data consists of more than two labels or in simple words if you are required to

classify the data in more than two categories then KNN can be a suitable algorithm.

When Data is labelled:

What do we mean when we say data is labelled? It means we already know the results of the data for a particular

datasets under our analysis and based on this we are trying make our model learn how to classify the future unknown data.

When Data is noise free:

Noisy data is data with a large amount of additional meaningless information in it.

This includes data corruption and the term is often used as a synonym for corrupt data.

It also includes any data that a user system cannot understand and interpreted correctly.

Now you know that Noise (in the data science space) is unwanted data items,

features or records which don’t help in explaining the feature itself, or the relationship between feature & target.

Noise often causes the algorithms to miss out patterns in the data.

Noise in tabular data can be of three types:

§ Anomalies in certain data items (Noise 1: certain anomalies in features & target)

§ Features that don’t help in explaining the target (Noise 2: irrelevant/weak features)

§ Records which don’t follow the form or relation which rest of the records do (Noise 3: noisy records)

Note: Features means an individual measurable property or characteristic of a phenomenon being observed.

The data features that you use to train your machine learning models. In future I will write on Feature Selection Techniques,

Noise identification and Dealing with Noisy Data in Machine Learning

When Dataset is small:

If dataset is too big KNN may underperform as KNN is lazy learner and

does not learn a discriminative function from training set.

Limitations of the KNN algorithm:

As it is clear that the KNN algorithm internally calculates the distance between the points,

it is therefore obvious that the time taken by the algorithm for classification will be more as compared to other algorithms in certain cases.

It is advised to use the KNN algorithm for multiclass classification if the number of samples of the data is less than 50,000.

Another limitation is the feature importance is not possible for the KNN algorithm.

It means that there is not an easy way which is defined to compute the features which are responsible for the classification.

Required Data Preparation:

Data Scaling: To locate the data point in multidimensional feature space,

it would be helpful if all features are on the same scale. Hence normalization or standardization of data will help.

Dimensionality Reduction: KNN may not work well if there are too many features.

Hence dimensionality reduction techniques like feature selection, principal component analysis can be implemented.

Missing value treatment: If out of M features one feature data is missing for a particular example in the training set then

we cannot locate or calculate distance from that point. Therefore deleting that row or imputation is required.

Monday, November 8, 2021

Confusion Matrix using scikit-learn in Python

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix

True Positive (TP)

The predicted value matches the actual value
The actual value was positive and the model predicted a positive value

True Negative (TN)

The predicted value matches the actual value
The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error

The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value
Also known as the Type 1 error

False Negative (FN) – Type 2 error

The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error

GITHUB Link

Data Analytics