Link: Supervised learning

Decision tree

What is a Decision Tree?

Decision tree is a tree-like flow chart to determine the outcome by using a series of decision check.

  • Nodes: The point (box) of the attribute which will split
  • Edges: The outcome of a split
  • Root: The first node that splits
  • Leaves: The terminal/final nodes which predicts the outcome

The sequence of decisions

The sequence of decision check matters, because it determine which nodes will split first.

Entropy and Information Gain

The mathematical methods of choosing the best split.

Entropy

Information Gain:

Random Forest

What is Random Forest?

Random forest is a method to improve performance of decision trees

How it works

Random Sampling of Feature

Each tree randomly pick some information (features). In implementation, n random records and m features are taken from the data set having k number of records.

Individual Decision

Each tree makes the split and generates the output based on the given sample

Combining
  • Classification: Majority Voting
  • Regression: Averaging
m value

m value is a value in classification that controls the randomness in selecting features for each tree (usually set as (features))

ChatGPT example

Different friends give Y/N picnic advice based on different available information/features

Question

ChatGPT example is so confusing and I don’t think it makes sense at all The problem is that the features should have different weighting when making the decision.

For example, if the weather is going to be bad, then it’s a complete no no for picnic and we don’t even need to look at other factors. In that case, any trees who doesn’t have the weather information and made the “Yes” decision is not valid.

Extend reading

Random Forest Algorithms - Comprehensive Guide With Examples

Why do we need random forest?

The problem of decision trees

Decision tree is not good at predictive accuracy, because:

High variance

Different splits can lead to different tree structures

Bagging: A machine learning procedure to reduce the variance

Bagged trees: high correlated

If a feature is very strong in the data set, when using bagged trees, most trees will choose this feature as the top split which leads to similar tree structures, resulting a high correlated trees that we usually want to avoid

Decorrelates the trees

Random forest decorrelates the trees, and each tree will more dependent to each other.