Decision Tree

Find the feature and criteria that best differentiate the data.
Q) How to findd the criteria that best differentiate the data?
- -> Maximize the information gain. (Minimize the Impurity (Entropy))

$Entropy = - \sum\nolimits_i {({p_i}){{\log }_2}({p_i})}$

Entropy $Entropy(A) = \sum\limits_{i = 1}^d {{R_i}\left( { - \sum\limits_{k = 1}^m {{p_k}{{\log }_2}({p_k})} } \right)}$
Gini-Index $G.I(A) = \sum\limits_{i = 1}^d {\left( {{R_i}\left( {1 - \sum\limits_{k = 1}^m {{p_{ik}}^2} } \right)} \right)}$

Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
We have to limit the maximum depth, the maximum number of nodes, minimum number of data for one node to split.

𝐶𝐶(𝑇)=𝐸𝑟𝑟(𝑇)+𝛼×𝐿(𝑇)
- 𝐶𝐶(𝑇): 의사결정나무의 비용 복잡도 (=오류가 적으면서 terminal node 수가 적은 단순한 모델일수록 작은 값)
- 𝐸𝑅𝑅(𝑇): 검증 데이터에 대한 오분류율
- 𝐿(𝑇): terminal node의 수 (구조 복잡도)
- 𝐴𝑙𝑝ℎ𝑎: 𝐸𝑅𝑅(𝑇)와 𝐿(𝑇)를 결합하는 가중치 (사용자에 의해 부여, 보통 0.01~0.1의 값을 사용)

의사결정나무는 한번에 하나씩의 설명변수를 사용하여 예측 가능한 규칙들의 집합을 생성하는 알고리즘.
- 한번 분기 때마다 변수 영역을 두개로 구분한 뒤 각 영역의 순도가 증가(엔트로피가 감소)하는 방향으로 학습을 진행함.
- 입력 변수 영역을 두 개로 구분하는 **재귀적 분기(recursive partioning)**와 구분된 영역을 통합하는 **가지치기(pruning)**의 두가지 과정으로 나뉨.
의사결정나무는 계산복잡성 대비 높은 예측 성능을 내는 장점을 지님.
하지만, 결정경계(decision boundary)가 데이터 축에 수직이어서 특정 데이터에만 잘 작동하는 한계가 있음.
- 이러한 문제를 극복하기 위해 랜덤포레스트와 같은 기법을 사용.
- 랜덤포레스트란?
  - 같은 데이터에 대해 의사결정나무를 여러 개 만들어서 그 결과를 조합해서 예측 성능을 높이는 기법
  - 동일한 데이터로부터 복원추출을 통해 30개 이상의 데이터셋을 만들어서 각각에 의사결정나무를 적욯안 뒤 학습 결과를 취합하는 방식으로 동작

References

Theoretical Explanation
- https://bkshin.tistory.com/entry/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-4-%EA%B2%B0%EC%A0%95-%ED%8A%B8%EB%A6%ACDecision-Tree
- https://ratsgo.github.io/machine%20learning/2017/03/26/tree/
Python Implementation of Decision Tree
- https://towardsdatascience.com/construct-a-decision-tree-and-how-to-deal-with-overfitting-f907efc1492d
Scikit-learn documentation
- https://scikit-learn.org/stable/modules/tree.html
MATHLAB Example
- fitctree & fitrtree (instruction & algorithm)
  - https://kr.mathworks.com/help/stats/fitctree.html
  - https://kr.mathworks.com/help/stats/fitrtree.html
- View Decision Tree
  - https://kr.mathworks.com/help/stats/view-decision-tree.html
- Prediction Using Classification and Regression Trees
  - https://kr.mathworks.com/help/stats/prediction-using-classification-and-regression-trees.html

Last updated 3 years ago

Was this helpful?