Wish Christmas with R

Wish Christmas with R

This post is dedicated to all the R Lovers…Flaunt your knowledge of R programming in your peer group with the following code.

Wish Christmas with R

Method 1 : Run the following program and see what I mean –

paste(rep(intToUtf8(acos(exp(0)/2)*180/pi+2^4+3*2),2), collapse = intToUtf8(0)),
LETTERS[5^(3-1)], intToUtf8(atan(1/sqrt(3))*180/pi+2),
sep = intToUtf8(0)


Method 2 : Audio Wish for Christmas

Turn on computer speakers before running the code.

christmas_file <- tempfile()
download.file(“https://sites.google.com/site/pocketecoworld/merrychristmas1.wav”, christmas_file, mode = “wb”)
xmas <- load.wave(christmas_file)

Wish Christmas with R

When Predictive Analytics is a Bad Idea

When Predictive Analytics is a Bad Idea

All the buzz these days seems to be around Predictive Analytics. Many think off predictive modeling a magic from data. Since reports about downturn in IT sector, many IT professionals aspire to be predictive modelers. They enroll for some courses and learn concepts by rote. When you learn model accuracy / performance metrics by rote, you’ll never understand the physical significance of it. They focus only on thumb rules/ cutoff so that they can mug up for interview preparation. Let me give you an example when so called “model performance rules” fails. 

Around 4,000 people have been killed by drone strikes in Pakistan since 2004. According to leaked documents on The Intercept, these drone strikes happened based on results from the machine learning algorithm. The disastrous result is that the thousands of innocent people in Pakistan may have been mislabelled as terrorists by the algorithm.

Imagine a world where people have been killed because of some bloody “machine learning algorithm”.


This article reveals ‘Till what extent Predictive Modeling can be used’. Predictive Modeling is a process which is used to build a model with the help of historical data to predict future behavior. In the process of predictive modeling, we use statistical and machine learning techniques. In this post, we will see how predictive modeling / data mining techniques can be used to identify terrorists.

Identify Terrorists Attacks with Predictive Modeling
Terrorists attacks are happening in every part of the world. Every day government announces a new terror alert. It has become priority of every government to eradicate terrorism from their country. Some countries have developed analytics-driven software to predict or forecast terrorists attacks. The software identifies patterns from the historical data and predicts terrorist activities.

The Australian Security Agency designed a terror attack system that let their citizens a clearer idea of whether they should be alert or alarmed. It classifies threats into five levels – Not Expected, Possible, Probable, Expected and Certain.

Likelihood of being a Terrorist
US National Security Agency use a machine learning algorithm to assess each person’s likelihood of being a terrorist. They used Pakistan’s mobile network metadata of 55 million people to develop a model to identify terrorists.

Around 4,000 people have been killed by drone strikes in Pakistan since 2004. According to leaked documents on The Intercept, these drone strikes happened based on results from the machine learning algorithm. The disastrous result is that the thousands of innocent people in Pakistan may have been mislabelled as terrorists by the algorithm.


Target / Dependent Variable – Whether a person is terrorist or not
Predictors / Independent Variable – 80 Variables. Some of the variables are listed below –

Travel Patterns
No. of visits to terrorist states
Moved Permanently to terrorist states
Overnight Trips
Travel on particular day of the week
Regular Visits to locations of Interest
Travel Phrases
Other Predictors
Low use / income calls only
Excessive SIM or Handset Swapping
Frequent Detach / Power-Down
Common Contacts
User Location
Pattern of Life
Social Network
Visits to airports
Data Preparation

Number of Events  :  Data from just seven known terrorists.
Number of Non-Events : 100,000 users were selected at random


Random Forest was used as a machine learning algorithm. No much detail is specified in the NSA presentation file. Not sure whether they used stacking/blending ensemble learning algorithm.

Model Results :

1. 50% False Negative Rate. It refers to “50% actual terrorists but model incorrectly predicted them as “Non-Terrorists”. 

2. 0.18% False Positive Rate.  It refers to “0.18% innocents, but model incorrectly predicted as terrorists.

A false positive rate of 0.18 percent across 55 million people would mean 99,000 innocents mislabelled as “terrorists”

In marketing or credit risk models, 0.18% false positive rate is considered as an excellent score. But it is dangerous in the context of human lives. Even 0.01% false positive rate of 55 million population implies 5,500 innocent people potentially being misclassified as “terrorists” and killed.

The highest rated target according to this machine learning was Ahmad Zaidan, Al-Jazeera’s long-time bureau chief in Islamabad.

Issue / Challenges related to this kind of model

  1. Event Rate : The main issue of the model is that they used a very few events (7 terrorists) to train the model. Machine learning algorithms require more events than classical statistical techniques. 
  2. Unstructured Data : Huge amount of data but unstructured
  3. Collaboration between Countries : Official data sharing security pact
  4. Implementation : It is very dangerous to implement the model and kill someone after blindly following results from the model.
Several areas where we can leverage analytics to identify terrorists activities
  1. Identifying terrorist financing which provides funds for terrorists activities
  2. Profiling people who are educated but involved in terrorists activities.
  3. Correlating terrorist attacks with trends in geo-politics and money trails

For original article , click here

When Predictive Analytics is a Bad Idea

Learn Python in 3 days : Step by Step Guide

Learn Python in 3 days : Step by Step Guide

This tutorial helps you to get started with Python. It’s a step by step practical guide to learn Python by examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.

Table of Contents

  1. Getting Started with Python
    • Python 2.7 vs. 3.6
    • Python for Data Science
    • How to install Python?
    • Spyder Shortcut keys
    • Basic programs in Python
    • Comparison, Logical and Assignment Operators
  2. Data Structures and Conditional Statements
    • Python Data Structures
    • Python Conditional Statements
  3. Python Libraries
    • List of popular packages (comparison with R)
    • Popular python commands
    • How to import a package
  4. Data Manipulation using Pandas
    • Pandas Data Structures – Series and DataFrame
    • Important Pandas Functions (vs. R functions)
    • Examples – Data analysis with Pandas
  5. Data Science with Python
    • Logistic Regression
    • Decision Tree
    • Random Forest
    • Grid Search – Hyper Parameter Tuning
    • Cross Validation
    • Preprocessing Steps

Select Important Variables using Boruta Algorithm

Select Important Variables using Boruta Algorithm

This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called ‘Feature Selection’. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?

  1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
  2. Too many variables might result to overfitting which means model is not able to generalize pattern
  3. Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises ” What makes boruta package so special”.  See the following reasons to use boruta package for feature selection.

  1. It works well for both classification and regression problem.
  2. It takes into account multi-variable relationships.
  3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
  4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
  5. It can handle interactions between variables
  6. It can deal with fluctuating nature of random a random forest importance measure
Boruta Package
Basic Idea of Boruta Algorithm

Perform shuffling of predictors’ values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works
Follow the steps below to understand the algorithm –

  1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
  2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
  3. Combine the original ones with shuffled copies
  4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
  5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
  6. Find the maximum Z score among shadow attributes (MZSA)
  7. Tag the variables as ‘unimportant’  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
  8. Tag the variables as ‘important’  when they have importance significantly higher than MZSA.
  9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged ‘unimportant’ or ‘important’, whichever comes first.

Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question ‘RF importance measure vs. Boruta’ made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.

Is Boruta a solution for all?

Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

Check out the original article – Feature selection with Boruta in R to see implementation of Boruta Algorithm with R and its comparison with other feature selection algorithms.

Select Important Variables using Boruta Algorithm