Decision Trees

Decision trees are a classification algorithm within supervised learning. The algorithm determines a set of questions or tests that will guide it toward a classification of an observation and it organizes a series of attribute tests into a tree-structure to help determine classification of the unlabeled data.

Motivating Question: Given a set of data, can we determine which attributes should be tested first to predict a category or outcome (i.e., which attributes lead to “high information gain”)?

Simple Scenario

Suppose we have:

  • a group of people, each one with a tumor, and
  • two measurements (x, y) for each tumor.

Plotting the data, and coloring the points red for malignant tumors and blue for benign tumors, we might see a plot as follows:

Clearly, something happens near x=3.

With very few errors, we can use x=3 as our “decision” to categorize the tumor as malignant versus benign.

Resulting decision tree:

Unfortunately, it is not always this easy, especially if we have much more complex data. More layers of questions can be added with more attributes.

Example: What should you do this weekend?

Weather Parents Visiting Have extra cash Weekend Activity
Sunny Yes Yes Cinema
Sunny No Yes Tennis
Windy Yes Yes Cinema
Rainy Yes No Cinema
Rainy No Yes Stay In
Rainy Yes No Cinema
Windy No No Cinema
Windy No Yes Shopping
Windy Yes Yes Cinema
Sunny No Yes Tennis

This table can be represented as a tree.

This tree can be made more efficient.

Also with complex data, it is possible that not all features are needed in the Decision Tree.

Decision Tree Algorithms

There are many existing Decision Tree algorithms. If written correctly, the algorithm will determine the best question/test for the tree.

How do we know how accurate our decision tree is?

Decision Tree Evaluation

  • A confusion matrix is often used to show how well the model matched the actual classifications.
    • The matrix is not confusing – it simply illustrates how “confused” the model is!
  • It is generated based on test data.
Previous
Next