What is a random forest?
* A random forest is a machine learning technique that’s used to solve regression and classification problems.
* It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.
* A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating.
* Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
* The (random forest) algorithm establishes the outcome based on the predictions of the decision trees.
* It predicts by taking the average or mean of the output from various trees.
* Increasing the number of trees increases the precision of the outcome.
*A random forest eradicates the limitations of a decision tree algorithm.
* It reduces the overfitting of datasets and increases precision.
* It generates predictions without requiring many configurations in packages (like scikit-learn).
*A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms.
* This algorithm is applied in various industries such as banking and e-commerce to predict behavior and outcomes.
Features of a Random Forest Algorithm
*It’s more accurate than the decision tree algorithm.
*It provides an effective way of handling missing data.
*It can produce a reasonable prediction without hyper-parameter tuning.
*It solves the issue of overfitting in decision trees.
*In every random forest tree, a subset of features is selected randomly at the node’s splitting point.
Working of Random Forest Algorithm
* Before understanding the working of the random forest we must look into the ensemble technique. Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.
* Ensemble uses two types of methods:
Bagging– It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example, Random Forest.
Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST
Bagging
*Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set.
*Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling.
*This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results.
*The final output is based on majority voting after combining the results of all models.
*This step which involves combining all the results and generating output based on majority voting is known as aggregation
Steps involved in random forest algorithm:
Step 1: In Random forest n number of random records are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
Important Features of Random Forest
Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.
Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.
Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.
Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
Stability- Stability arises because the result is based on majority voting/ averaging.
Difference Between Decision Tree & Random Forest
Decision trees
------------------------------------------------------------------------------------
1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control
2. A single decision tree is faster in computation.
3. When a data set with features is taken as input by a decision tree it will formulate some set of rules to do prediction
Random Forest
-----------------------------------------------------------------------------------
1. Random forests are created from subsets of data and the final output is based on average or majority ranking and hence the problem of overfitting is taken care of.
2. It is comparatively slower.
3. Random forest randomly selects observations, builds a decision tree and the average result is taken. It doesn’t use any set of formulas.
Important Hyperparameters
Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.
Following hyperparameters increases the predictive power:
n_estimators– number of trees the algorithm builds before averaging the predictions.
max_features– maximum number of features random forest considers splitting a node.
mini_sample_leaf– determines the minimum number of leaves required to split an internal node.
Following hyperparameters increases the speed:
n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor
but if the value is -1 there is no limit.
random_state– controls randomness of the sample.
The model will always produce the same results if it has a definite value of random state
and if it has been given the same hyperparameters and the same training data.
oob_score – OOB means out of the bag. It is a random forest cross-validation method and
In this one-third of the sample is not used to train the data instead used to evaluate its performance.
These samples are called out of bag samples.
Applications of random forest
Some of the applications of the random forest may include:
Banking
Random forest is used in banking to predict the creditworthiness of a loan applicant. This helps the lending institution make a good decision on whether to give the customer the loan or not. Banks also use the random forest algorithm to detect fraudsters.
Health care :
Health professionals use random forest systems to diagnose patients. Patients are diagnosed by assessing their previous medical history. Past medical records are reviewed to establish the right dosage for the patients.
Stock market :
Financial analysts use it to identify potential markets for stocks. It also enables them to identify the behavior of stocks.
E-commerce:
Through rain forest algorithms, e-commerce vendors can predict the preference of customers based on past consumption behavior.
When to avoid using random forests
Random forest algorithms are not ideal in the following situations:
Extrapolation::
Random forest regression is not ideal in the extrapolation of data. Unlike linear regression, which uses existing observations to estimate values beyond the observation range. This explains why most applications of random forest relate to classification.
Sparse data::
Random forest does not produce good results when the data is very sparse. In this case, the subset of features and the bootstrapped sample will produce an invariant space. This will lead to unproductive splits, which will affect the outcome.
Advantages of random forest
* It can perform both regression and classification tasks.
* A random forest produces good predictions that can be understood easily.
* It can handle large datasets efficiently.
* The random forest algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.
Disadvantages of random forest
* When using a random forest, more resources are required for computation.
* It consumes more time compared to a decision tree algorithm.