When Predictive Analytics is a Bad Idea
All the buzz these days seems to be around Predictive Analytics. Many think off predictive modeling a magic from data. Since reports about downturn in IT sector, many IT professionals aspire to be predictive modelers. They enroll for some courses and learn concepts by rote. When you learn model accuracy / performance metrics by rote, you’ll never understand the physical significance of it. They focus only on thumb rules/ cutoff so that they can mug up for interview preparation. Let me give you an example when so called “model performance rules” fails.
Around 4,000 people have been killed by drone strikes in Pakistan since 2004. According to leaked documents on The Intercept, these drone strikes happened based on results from the machine learning algorithm. The disastrous result is that the thousands of innocent people in Pakistan may have been mislabelled as terrorists by the algorithm.
Imagine a world where people have been killed because of some bloody “machine learning algorithm”.
IDENTIFY TERRORISTS WITH PREDICTIVE MODELING
This article reveals ‘Till what extent Predictive Modeling can be used’. Predictive Modeling is a process which is used to build a model with the help of historical data to predict future behavior. In the process of predictive modeling, we use statistical and machine learning techniques. In this post, we will see how predictive modeling / data mining techniques can be used to identify terrorists.
Identify Terrorists Attacks with Predictive Modeling |
The Australian Security Agency designed a terror attack system that let their citizens a clearer idea of whether they should be alert or alarmed. It classifies threats into five levels – Not Expected, Possible, Probable, Expected and Certain.
Likelihood of being a Terrorist
US National Security Agency use a machine learning algorithm to assess each person’s likelihood of being a terrorist. They used Pakistan’s mobile network metadata of 55 million people to develop a model to identify terrorists.
Around 4,000 people have been killed by drone strikes in Pakistan since 2004. According to leaked documents on The Intercept, these drone strikes happened based on results from the machine learning algorithm. The disastrous result is that the thousands of innocent people in Pakistan may have been mislabelled as terrorists by the algorithm.
Target / Dependent Variable – Whether a person is terrorist or not
Predictors / Independent Variable – 80 Variables. Some of the variables are listed below –
Travel Patterns |
---|
No. of visits to terrorist states |
Moved Permanently to terrorist states |
Overnight Trips |
Travel on particular day of the week |
Regular Visits to locations of Interest |
Travel Phrases |
Other Predictors |
---|
Low use / income calls only |
Excessive SIM or Handset Swapping |
Frequent Detach / Power-Down |
Common Contacts |
User Location |
Pattern of Life |
Social Network |
Visits to airports |
Number of Events : Data from just seven known terrorists.
Number of Non-Events : 100,000 users were selected at random
Random Forest was used as a machine learning algorithm. No much detail is specified in the NSA presentation file. Not sure whether they used stacking/blending ensemble learning algorithm.
Model Results :
1. 50% False Negative Rate. It refers to “50% actual terrorists but model incorrectly predicted them as “Non-Terrorists”.
2. 0.18% False Positive Rate. It refers to “0.18% innocents, but model incorrectly predicted as terrorists.
A false positive rate of 0.18 percent across 55 million people would mean 99,000 innocents mislabelled as “terrorists”
In marketing or credit risk models, 0.18% false positive rate is considered as an excellent score. But it is dangerous in the context of human lives. Even 0.01% false positive rate of 55 million population implies 5,500 innocent people potentially being misclassified as “terrorists” and killed.
The highest rated target according to this machine learning was Ahmad Zaidan, Al-Jazeera’s long-time bureau chief in Islamabad.
Issue / Challenges related to this kind of model
- Event Rate : The main issue of the model is that they used a very few events (7 terrorists) to train the model. Machine learning algorithms require more events than classical statistical techniques.
- Unstructured Data : Huge amount of data but unstructured
- Collaboration between Countries : Official data sharing security pact
- Implementation : It is very dangerous to implement the model and kill someone after blindly following results from the model.
- Identifying terrorist financing which provides funds for terrorists activities
- Profiling people who are educated but involved in terrorists activities.
- Correlating terrorist attacks with trends in geo-politics and money trails
For original article , click here
Recent Comments