Model Insights
Many people say machine learning models are “black boxes”, in the sense that they can make good predictions but you can’t understand the logic behind those predictions. This statement is true in the sense that most data scientists don’t know how to extract insights from models yet.
- What features in the data did the model think are most important?
- For any single prediction from a model, how did each feature in the data affect that particular prediction?
- How does each feature affect the model’s predictions in a big-picture sense (what is its typical effect when considered over a large number of possible predictions)?
Why Are These Insights Valuable?
These insights have many uses, including:
- Debugging.
- Informing feature engineering.
- Directing future data collection.
- Informing human decision-making.
- Building trust.
Debugging
The world has a lot of unreliable, disorganized and generally dirty data. You add a potential source of errors as you write preprocessing code. Add in the potential for target leakage, and it is the norm rather than the exception to have errors at some point in a real data science project.
Given the frequency and potentially disastrous consequences of bugs, debugging is one of the most valuable skills in data science. Understanding the patterns a model is finding will help you identify when those are at odds with your knowledge of the real world, and this is typically the first step in tracking down bugs.
Informing Feature Engineering
Feature engineering is usually the most effective way to improve model accuracy. Feature engineering usually involves repeatedly creating new features using transformations of your raw data or features you have previously created.
Sometimes you can go through this process using nothing but intuition about the underlying topic. But you’ll need more direction when you have 100s of raw features or when you lack background knowledge about the topic you are working on.
A Kaggle competition to predict loan defaults gives an extreme example. This competition had 100s of raw features. For privacy reasons, the features had names like f1, f2, f3 rather than common English names. This simulated a scenario where you have little intuition about the raw data.
One competitor found that the difference between two of the features, specifically f527 - f528, created a very powerful new feature. Models including that difference as a feature were far better than models without it. But how might you think of creating this variable when you start with hundreds of variables?
The techniques you’ll learn in this micro-course would make it transparent that f527 and f528 are important features, and that their role is tightly entangled. This will direct you to consider transformations of these two variables, and likely find the “golden feature” of f527 - f528.
As an increasing number of datasets start with 100s or 1000s of raw features, this approach is becoming increasingly important.
Directing Future Data Collection
You have no control over datasets you download online. But many businesses and organizations using data science have opportunities to expand what types of data they collect. Collecting new types of data can be expensive or inconvenient, so they only want to do this if they know it will be worthwhile. Model-based insights give you a good understanding of the value of features you currently have, which will help you reason about what new values may be most helpful.
Informing Human Decision-Making
Some decisions are made automatically by models. Amazon doesn’t have humans (or elves) scurry to decide what to show you whenever you go to their website. But many important decisions are made by humans. For these decisions, insights can be more valuable than predictions.
Building Trust
Many people won’t assume they can trust your model for important decisions without verifying some basic facts. This is a smart precaution given the frequency of data errors. In practice, showing insights that fit their general understanding of the problem will help build trust, even among people with little deep knowledge of data science.