Link: Supervised learning
Decision tree
What is a Decision Tree?
Decision tree is a tree-like flow chart to determine the outcome by using a series of decision check.
- Nodes: The point (box) of the attribute which will split
- Edges: The outcome of a split
- Root: The first node that splits
- Leaves: The terminal/final nodes which predicts the outcome
The sequence of decisions
The sequence of decision check matters, because it determine which nodes will split first.
Entropy and Information Gain
The mathematical methods of choosing the best split.
Entropy
Information Gain:
Random Forest
What is Random Forest?
Random forest is a method to improve performance of decision trees
How it works
Random Sampling of Feature
Each tree randomly pick some information (features). In implementation, n random records and m features are taken from the data set having k number of records.
Individual Decision
Each tree makes the split and generates the output based on the given sample
Combining
- Classification: Majority Voting
- Regression: Averaging
m value
m value is a value in classification that controls the randomness in selecting features for each tree (usually set as (features))
ChatGPT example
Different friends give Y/N picnic advice based on different available information/features
Question
ChatGPT example is so confusing and I don’t think it makes sense at all The problem is that the features should have different weighting when making the decision.
For example, if the weather is going to be bad, then it’s a complete no no for picnic and we don’t even need to look at other factors. In that case, any trees who doesn’t have the weather information and made the “Yes” decision is not valid.
Extend reading
Random Forest Algorithms - Comprehensive Guide With Examples
Why do we need random forest?
The problem of decision trees
Decision tree is not good at predictive accuracy, because:
High variance
Different splits can lead to different tree structures
Bagging: A machine learning procedure to reduce the variance
Bagged trees: high correlated
If a feature is very strong in the data set, when using bagged trees, most trees will choose this feature as the top split which leads to similar tree structures, resulting a high correlated trees that we usually want to avoid
Decorrelates the trees
Random forest decorrelates the trees, and each tree will more dependent to each other.