Decision tree and random forest

Link: Supervised learning

Decision tree

What is a Decision Tree?

Decision tree is a tree-like flow chart to determine the outcome by using a series of decision check.

Nodes: The point (box) of the attribute which will split
Edges: The outcome of a split
Root: The first node that splits
Leaves: The terminal/final nodes which predicts the outcome

The sequence of decisions

The sequence of decision check matters, because it determine which nodes will split first.

Entropy and Information Gain

The mathematical methods of choosing the best split.

Entropy

H (X) = - i \sum P (x_{i}) lo g_{2} (P (x_{i}))

Information Gain:

I G (S, A) = H (S) - v \in values (A) \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})

Random Forest

What is Random Forest?

Random forest is a method to improve performance of decision trees

How it works

Random Sampling of Feature

Each tree randomly pick some information (features). In implementation, n random records and m features are taken from the data set having k number of records.

Individual Decision

Each tree makes the split and generates the output based on the given sample

Combining

Classification: Majority Voting
Regression: Averaging

m value

m value is a value in classification that controls the randomness in selecting features for each tree (usually set as $p$ (features))

ChatGPT example

Different friends give Y/N picnic advice based on different available information/features

Question

ChatGPT example is so confusing and I don’t think it makes sense at all The problem is that the features should have different weighting when making the decision.

For example, if the weather is going to be bad, then it’s a complete no no for picnic and we don’t even need to look at other factors. In that case, any trees who doesn’t have the weather information and made the “Yes” decision is not valid.

Extend reading

Random Forest Algorithms - Comprehensive Guide With Examples

Why do we need random forest?

The problem of decision trees

Decision tree is not good at predictive accuracy, because:

High variance

Different splits can lead to different tree structures

Bagging: A machine learning procedure to reduce the variance

Bagged trees: high correlated

If a feature is very strong in the data set, when using bagged trees, most trees will choose this feature as the top split which leads to similar tree structures, resulting a high correlated trees that we usually want to avoid

Decorrelates the trees

Random forest decorrelates the trees, and each tree will more dependent to each other.

🪴 Aster's notebook

Recent updated

Explorer

Decision tree and random forest

Decision tree

What is a Decision Tree?

The sequence of decisions

Entropy and Information Gain

Random Forest

What is Random Forest?

How it works

Random Sampling of Feature

Individual Decision

Combining

m value

Extend reading

Why do we need random forest?

The problem of decision trees

High variance

Bagged trees: high correlated

Decorrelates the trees

Graph View

Table of Contents

Backlinks

🪴 Aster's notebook

Recent updated

Explorer

Decision tree and random forest

Decision tree §

What is a Decision Tree? §

The sequence of decisions §

Entropy and Information Gain §

Random Forest §

What is Random Forest? §

How it works §

Random Sampling of Feature §

Individual Decision §

Combining §

m value §

Extend reading §

Why do we need random forest? §

The problem of decision trees §

High variance §

Bagged trees: high correlated §

Decorrelates the trees §

Graph View

Table of Contents

Backlinks

Decision tree

What is a Decision Tree?

The sequence of decisions

Entropy and Information Gain

Random Forest

What is Random Forest?

How it works

Random Sampling of Feature

Individual Decision

Combining

m value

Extend reading

Why do we need random forest?

The problem of decision trees

High variance

Bagged trees: high correlated

Decorrelates the trees