Sophisticated data strategies call for better analytical features and language diversity

Sophisticated data strategies call for better analytical features and language diversity

As we move into 2018, Analytical Data Infrastructure (ADI) is becoming a significant topic in business intelligence and analytics. Where Big Data was once an over-hyped, catch-all term, in the coming year we will see organisations move to a place where business-oriented ‘data strategies’ are the major focus. With that shift comes the need for sophisticated, yet easy to use, data science approaches that deliver results back to the business.

It is a point backed up by the 2018 Global Dresner Market Study for Analytical Data Infrastructure (ADI). The highly-regarded report revealed the key priorities for businesses for their data analytics and business intelligence efforts. From deployment to loading priorities, data preparation, modelling and management of data associated with ADI, the study captured the most important and current market trends driving the intelligent adoption of Data Science.


The Dresner report explored the ways in which end-users are planning to invest in ADI technology in the year ahead, along with the considerations behind implementation and use-cases. While security and performance were listed as the top two priorities for businesses, an interesting finding was that the biggest year-on-year change was the growing importance of easy access to and use of analytical features and programming languages such as the use of R, Machine Learning technology and MapReduce analytics.


Businesses have woken up to the fact that there is value in their data. With the right tools, they can extract that value – tapping into insight to improve the way they sell to their customers, or to streamline business processes and reduce costs.


But often, data has to be extracted, cleansed and transferred to other systems. In most companies, the Business Intelligence competence centres are separate teams to the Data Science teams, and they rarely work closely together. Modern analytics platforms combine these two worlds and allow to do SQL-based data analytics, Map Reduce algorithms and data science languages such as R or Python side by side. Many database vendors offer such capabilities, and some have even integrated these languages tightly into databases, allowing organisations to run data science on huge data sets.


While cleansing the data and finding the right models is a repetitive task that is sufficient to run on smaller data sets, high performance in-memory computing can make a vast difference when applying created R or Python models to billions of user data, in near-real-time.


Letting analysts use the data science tools of choice

Data analysts have their favoured analytics and visualisation tools which either leads to a wide spread of different tools that have to be integrated and maintained in the data management eco system, or to people not cooperating with each other. Further, the actual data science scripting language is often a personal preference. Each language has its own strengths and weaknesses in relation to the complexity of the task or features that the language offers.


As we move from an era of descriptive (looking at past trends), to predictive (looking to the future) and even to prescriptive (finding the best course of action to meet key performance indicators) for the most advanced analysis, the combination of AI and standard SQL analytics can create more agility and efficiency in finding the right insights out of data.


The good news is that there are platforms that combine any data science language within the same system, and combine it with standard database technologies. Exasol version 6.0 has for instance an open-sourced integration framework that allows to install any programming language and use it directly inside the SQL database. Pre-shipped languages are R, Python, Java and Lua, but you can also create containers for Julia, Scala, C++ or your choice.


Did you ever think it would be possible to provide normal SQL analysts access to data science results? Or that it would be possible to conduct powerful data processing in SQL rather than the programming languages? This leads to more flexibility, but essentially to exceptional performance.


Technologies has to follow your strategy

It will be interesting to see how data science technology evolves over time, and how companies move to leverage all possible ways of creating insights, predictions and automated prescriptions out of all kinds of data. This is not just a question of people’s skill sets or certain algorithms, but also the right architecture for your data eco system. It should facilitate data storage, standard reporting and data processing, artificial intelligence and a flexible way of adjusting to future trends in an open, extensible platform.


The technology should be available in the necessary ways – from a free downloadable solution to let developers play around on their laptops, to high-performance on-premise systems in your secured data center up to the standard public cloud platforms such as Amazon, Azure or Google. The technology should follow your data strategy, not the other way round.

Sophisticated data strategies call for better analytical features and language diversity

Hilarious Graphs (and Pirates) Prove That Correlation Is Not Causation

Hilarious Graphs (and Pirates) Prove That Correlation Is Not Causation

When it comes to storytelling, we have a problem.

It’s not our fault though – as human beings we are hard-wired from birth to look for patterns and explain why they happen. This problem doesn’t go away when we grow up though, it becomes worse the more intelligent we think we are. We convince ourselves that now we are older, wiser, smarter, that our conclusions are closer to the mark than when we were younger (the faster the wind blows the faster the windmill blades turn, not the other way around).

Even really smart people see a pattern and insist on putting an explanation to it, even when they don’t have enough information to reach such a conclusion. They can’t help it.

This is the thing about being human. We seek explanation for the events that happen around us. If something defies logic, we try to find a reason why it might make sense. If something doesn’t add up, we make it up.

The Post Hoc Fallacy
Ever heard the Latin expression Post Hoc, Ergo Propter Hoc, meaning ‘After this, therefore because of this’? Of course you have – it is the basis of the saying ‘Correlation Does Not Imply Causation’. It is also known in statistics as the Post Hoc Fallacy, and is a very familiar trap that we all fall into from time to time. This is the idea that when things are observed to happen in sequence, we infer that the thing that happened first must have caused the thing that happened next.

The Post Hoc Fallacy is what causes a football manager to only wear purple socks on match days. He once wore them at a match and his team won. Obviously, it was the socks that did it. Now he fears that if doesn’t wear them to a match the team might lose. Damn those stinky purple socks (he also daren’t wash them for fear of the magic pixie dust washing out).

Post Hoc is also what made rain men indispensible to the tribe – they believed that their rain man can make it rain. Spotting the clouds brewing in the distance, the rain man dances until it pours it down. It doesn’t usually take more than three or four days of dancing until the inevitable happens. “Rain man dance, water fall from sky”. It’s just a good job for the rain man that the Indians couldn’t speak Latin, otherwise he’d have been in real trouble…

The Post Hoc Piracy Postulation
For a humorous view of the Post Hoc Fallacy, let’s take a look at Pastafarianism. It’s all the rage these days. Not heard of it? It’s one of the newest and fastest growing religions on the block. Pastafarian Sparrowism, to give it its full title, is a ‘vibrant religion that seeks to bring the Flying Spaghetti Monster’s fleeting affection to all of us, through the life of His Prophet, Captain Jack Sparrow’. Seriously, they’re not joking. Well, actually, they are. They promote a light-hearted view of religion and oppose the teaching of intelligent design and creationism in public schools. They also maintain that pirates are the original Pastafarians.

In an effort to illustrate that correlation does not imply causation, the founder, Bobby Henderson, presented the argument that global warming is a direct effect of the shrinking number of pirates since the 1800s, and accompanied it with this graph:

Pirates Caused Global Warming. Honest…

Wow, look at that straight line, I hear you all say – there’s clearly a correlation between the decline in the numbers of pirates and the rise in global temperatures, so there just must be a causal connection here, mustn’t there? Yup, you’ve all just fallen for the Post Hoc Fallacy (I just knew you would).

Just because there is a straight line on the graph it doesn’t necessarily follow that one thing caused the other, particularly when you’ve grabbed two seemingly unconnected variables at random and stuck them together to see whether there might be some sort of tenuous correlation between them.

In the case of pirates and global warming, take a closer look at the labels on the x-axis. Notice something strange? Apart from the fact that the proportions of neighbouring data points are all out of whack, there is also the issue that a couple of them have been humorously disordered to deliberately deceive.

I don’t know about you, but I’m a believer! As soon as I’ve hit the ‘Publish’ button I’m giving up stats for a life as a pirate on the open seas. I’ll stop global warming if it’s the last thing I do.

It probably will be…


This blog post is an extract from the witty new book Truth, Lies and Statistics, FREE at Amazon

Here’s the blurb:

Pirates, cats, Mexican lemons and North Carolina lawyers.
Cheese consumption, margarine and drowning by falling out of fishing boats.

This book has got it all.

In this eye-opening book, award winning statistician and author Lee Baker uncovers the key tricks used by statistical hustlers to deceive, hoodwink and dupe the unwary.

Written as a layman’s guide to lying, cheating and deceiving with data and statistics, there’s not a dull page in sight!

A roller coaster of a book in 8 witty chapters, this might just be the most entertaining statistics book you’ll read this year.

Discover the exciting world of statistical cheating and persuasive misdirection.

The Organic Autism Correlation Conundrum
If you look online there are all sorts of humorous graphs that prove the Post Hoc Fallacy. Over the past 20 years or so, there’s been a huge increase in the anti-vaccine movement, particularly in the US, and there have been all sorts of spurious correlations that have been ‘discovered’ that ‘prove’ that there is a causal link between vaccination programmes and autism. At the same time, to debunk the most crackpot of the theories, other – equally ridiculous – correlations have popped up too.

There was one that was published that showed the correlation between sales of organic food in the US and diagnosis of autism:

Organic Food Causes Autism. Oh My…

There is a very close correlation between the pair of plot lines, even accompanied by a very large r-value (close to 1) and a very small p-value (close to 0). The suggestion is that – if we trust that correlation does imply causation – a much closer correlation exists between organic food and autism than any other theory that currently exists, so therefore it must be the cause. Except that correlation does not necessarily imply causation, and organic food does not cause autism. That would be ridiculous. And that is the whole point of these graphs. All you need to do is find any pair of variables that increase over the same time period, plot them on a graph with the same x-axis and different y-axes, adjust the y-axis scales until the plot lines coalesce, and – BOOM – correlation! If, by some magic of coincidence and fate, there is a statistical correlation, then publish the p-value that goes along with it as additional proof. What this does is prove that the correlation exists, but it does not prove that one thing causes the other. It might, but then again it might not…

The Lemon Fatality Correlation Convergence
I also quite enjoyed the correlation that proved that Mexican lemons are a major cause of deaths on US roads. Wait, what? I must have missed the news that day – Mexican lemons are killing Americans? You bet!

Take a look at a plot of the number of fresh lemons imported into the USA from Mexico versus the total fatality rate on US highways between 1996 and 2000:

Mexican Lemons Kill Americans!

My, my, just look at the R-squared value – it really must be true. Although the graph seems to be telling us that the more Mexican lemons there are in the US the fewer road deaths there are, the inescapable conclusion is that MEXICAN LEMONS KILL AMERICANS! What should we do about it? Should we import more Mexican lemons (the correlation tells us that this is what we should do)? Or should we ban Mexican lemons altogether? After all, if there are no Mexican lemons on the streets then they can’t kill any more Americans.

What utter tosh! I don’t care if there is a correlation, there is nothing to suggest that lemons cause accidents. If there was, don’t you think that lemons would be causing accidents on Mexican roads before the trucks crossed into the US? What about Sicilian lemons? Do they cause road deaths in Italy and across Europe?

Oh, the power of correlations. As long as your audience doesn’t understand that correlation is not causation you can make them believe pretty much anything.

About the Author

Lee Baker is an award-winning software creator that lives behind a keyboard in a darkened room. Illuminated only by the light from his monitor, he aspires to finding the light switch.

With decades of experience in science, statistics and artificial intelligence, he has a passion for telling stories with data. Despite explaining it a dozen times, his mother still doesn’t understand what he does for a living.

Insisting that data analysis is much simpler than we think it is, he authors friendly, easy-to-understand blogs and books that teach the fundamentals of data analysis and statistics.

His mission is to unleash your inner data ninja!

As the CEO of Chi-Squared Innovations, one day he’d like to retire to do something simpler, like crocodile wrestling.

PS – Don’t forget to connect with me in Twitter: @eelrekab


Hilarious Graphs (and Pirates) Prove That Correlation Is Not Causation

Aspiring Data Scientists – Get Hired!

Aspiring Data Scientists – Get Hired!

Working in Data Science recruitment, we’re no strangers to the mountains you have to climb and pitfalls faced when getting into a Data Science career. Despite the mounting demand for Data Science professionals, it’s still an extremely difficult career path to break into. The most common complaints we see from candidates who have faced rejection are lack of experience, education level requirements, lack of opportunities for Freshers, overly demanding and confusing job role requirements.


First of all, let’s tackle what seems to be what seems the hardest obstacle to overcome, lack of experience. This is a complex one and not just applicable to Data Science, across professions it’s a common complaint that entry-level jobs ask for years’ worth of experience. Every company wants an experienced data scientist, but with the extremely fast emergence of the field and growing demand for professionals, there is not enough to go around! Our advice here for anyone trying to get into Data Science who is lacking experience is to try and get an internship by contacting companies directly. Sometimes, you will find these types of positions available with recruiters but you will no doubt have more luck going direct.

Another approach is to have a go at Kaggle competitions, write code and put this on GitHub for people to see. There are many ways you can gain experience in your spare time without this being in a business setting, in a way that a hiring manager will notice. If you have the time free too, think of offering free consultations to friends or businesses and build on opportunities like that. Go beyond publishing code on GitHub, and write a detailed post of your analysis and code on a blog, data site or even LinkedIn. This gives you even more exposure and exemplifies your deep understanding of what you do. There are also challenges for people with heaps of experience getting rejected due to ‘lack of experience’ and the truth is, is that lack of experience often translates to you have a lack of applicable experience to the role you’re applying for. To overcome these obstacles, make sure you’re reading job descriptions properly, researching the company and tailoring your resume to highlight how you are what they’re looking for.

Deciphering Job Descriptions

The growing demand for Data Scientists in a number of different industries, specializing in different fields means that it can be difficult for employers to define a reasonable, ‘blanket’ skill set required, which can lead to a lot of confusion for those starting out. Beyond knowing that a good Data Scientist needs to be a critical thinker, analytically minded, a great communicator and have a passion for the field, technical requirements and experience needed can vary greatly between roles and companies. Try not to be overwhelmed when looking at job descriptions. It’s important to remember that many companies will put on more skills and experience than actually needed into the job descriptions. So, even if you hold half of the skills they’re asking for, but make up for the rest in willingness to learn/passion for the role/transferable skills, then go for it – don’t be put off. If you’re not confident in doing so, try seeing the patterns in what is being asked for, highlighting the top required skills for the roles you want to apply for and take some time in getting better at these.

Reaching out

Many professionals, whilst having the qualifications needed, lack basic skills needed when it comes to communicating with hiring managers and recruiters. Commenting on LinkedIn posts asking for a review of your profile is not going to cut it, I’m afraid. Reach out directly to those that are posting the job adverts or if it’s a company, do some research and find the hiring manager or recruitment team. They’ll appreciate the direct approach, and you’ll be able to provide more information on why you should be considered for the role. It might seem like a good way to get noticed as CV’s can get lost in the mountains that recruiters receive… but this is where resume skills come in to play and knowing how to get yours noticed.

Resume skills

You’ve more than likely got some great points on your CV, experiences, and projects that are noteworthy but often, your CV will also be littered with irrelevant information to pad it out – especially if you’re just starting out. Our advice? Get rid of the filler, get to the point and highlight how you can make a difference where you’re applying to.

Make sure your skills, experience, and projects tell the hiring manager that you have the tools necessary to make an impact on their business and how when applying these techniques in the past, you’ve had x y z results. Quantify these results – how did it benefit the company in terms of revenue, ROI, time-saving or costs? Tailor your CV, don’t just send generic ones out. Exhibit your understanding of the fundamentals, that you have proficient knowledge of the foundations of data science and the rest will follow.

The layout is also important, hire a designer or put in some hours on free platforms out there that can help with this. Even on Word, you can create an interesting, eye-catching layout! You can see more on mastering your resume here. Another great way to soak up as much information about

Data Science is to follow influencers in your field on social media, especially LinkedIn – there are often really insightful posts, you can reach out to the data science community, learn new things, post questions and see current opportunities available.

Have you any other top tips for getting into Data Science? Please share in the comments!

Aspiring Data Scientists – Get Hired!

When Variable Reduction Doesn’t Work

When Variable Reduction Doesn’t Work

Summary:  Exceptions sometimes make the best rules.  Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.


One of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work.  Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables.  And woe to you who violate this rule.  Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.

Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.


A Little Background

In highly regulated industries like insurance and lending the variables that are allowed for use are highly regulated as are the modeling techniques.  Techniques are generally limited to those that are highly explainable, mostly GLM and simple decision trees.  Data can’t include anything that is overtly discriminatory under the law so, for example, race, sex, and age can’t be used, or at least not directly.  All of this works against model accuracy.

Traditionally what agencies could use to build risk models has been defined as ‘traditional data’, that which the consumer has submitted with their application and the data that can be added from the major credit rating agencies.  In this last case Experian and the others offer some 250 different variables and except for those that are specifically excluded by law, this seems like a pretty good sized inventory of predictive features.

But in the US and especially abroad the market contains many ‘thin-file’ or ‘no-file’ consumers who would like to borrow but for which traditional data sources simply don’t exist.  Millennials feature in this group because their cohort is young and doesn’t yet have much borrowing or credit history.  But also in this group are the folks judged to be marginal credit risks, some of whom could be good customers if only we knew how to judge the risk.


Enter the World of Alternative Data

‘Alternative data’ is considered to be any data not directly related to the consumer’s credit behavior, basically anything other than the application data and consumer credit bureau data.  A variety of agencies are prepared to provide it and it can include:

  1. Transaction data (e.g. checking account data)
  2. Telecom/utility/rent data
  3. Social profile data
  4. Social network data
  5. Clickstream data
  6. Audio and text data
  7. Survey data
  8. Mobile app data

As it turns out lenders have been embracing alternative data for the last several years and see real improvements in their credit models, particularly at the low end of the scores.  Even the CFPB has provisionally endorsed this to bring credit to the underserved.


From a Data Science Perspective

From a data science perspective, in this example we started out with on the order of 250 candidate features from ‘traditional data’, and now, using ‘alternative data’ we can add an additional 1,050 features.  What’s the first thing you do when you have 1,300 candidate variables?  You go through the steps necessary to identify only the most predictive variables and discard the rest.


Here’s Where It Gets Interesting

Lexis/Nexis, the provider of the alternative data, set out to demonstrate that a credit model built on all 1,300 features was superior to one built on only 250 traditional features.  The data was drawn from a full-file auto lending portfolio of just under 11 million instances.  You and I might have concluded that even 250 was too many but in order to keep the test rigorous they introduced these constraints. 

  1. The technique was limited to forward stepwise logistic regression. This provided clear univariate feedback on the importance of each variable.
  2. Only two models would be compared, one with the top 250 most predictive attributes and the other with all 1,300 attributes. This eliminated any bias from variable selection that might be introduced by the modeler.
  3. The variables for the 250 var model were selected by ranking the predictive power of each variables correlation to the dependent variable. As it happened all of the alternate variables fell outside the top 250 with the highest ranking 296th.
  4. The models were created with the same overall data prep procedures such as binning rules.


What Happened

As you might expect, the first and most important variable was the same for both models but began to diverge at the second variable.  The second variable in the 1,300 model was actually 296th based on the earlier predictive power analysis. 

When the model was completed the alternative data made up 25% of the model’s accuracy although none would have been included based on the top 250 predictive variables.

The KS (Kolmogorov-Smirnov) statistic was 4.3% better for the 1,300 model compared to the 250 model.


The Business Importance

The distribution of scores and charge offs for each models was very similar but in the bottom 5% of scores things changed.  There was a 6.4% increase in the number of predicted charge offs in this bottom group. 

Since the distributions are the essentially the same this can be seen as higher scores that might have been rated credit worthy migrating into the lowest categories of credit worthiness allowing better decisions about denial or pricing based on risk.  Conversely it appears that some lowest rated borrowers were given a boost with the additional data.

That also translates to a competitive advantage for those using the alternative data compared to those who don’t.  You can see the original study here.


There are Four Lesson for Data Scientists Here

  1. Think outside the box and consider the value of a large number of variables when first developing or refining your model. It wasn’t until just a few years ago that the insurance industry started looking at alternative data and on the margin it has increased accuracy in important ways.  FICO published this chart showing the relative value of each category of alternative data strongly supporting using more variables.


  1. Be careful about using ‘tried and true’ variable selection techniques. In the Lexis/Nexis case starting the modeling process with variable selection based on univariate correlation with the dependent variable was misleading.  There are a variety of other techniques they could have tried.
  2. Depending on the amount of prep, it still may not be worthwhile expanding your variables so dramatically. More data always means more prep means more time which in a commercial environment you may not have.  Still, be open to exploration.
  3. Adding ‘alternate source’ data to your decision making can be a two edged sword. In India, measures as obscure as how often a user charges his cell phone or its average charge level has proven to be predictive.  In that credit-starved environment these innovative measures are welcomed when they provide greater access to credit.

On the other hand just this week a major newspaper in England published as expose of comparative auto insurance rates where it discovered that individuals applying with a Hotmail account were paying as much as 7% more than those with Gmail accounts.  Apparently British insurers had found a legitimate correlation between risk and this alternative data.  It did not sit well with the public and the companies are now on the defensive.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

When Variable Reduction Doesn’t Work

Selected Recent Articles from Top DSC Contributors – Part 4

Selected Recent Articles from Top DSC Contributors – Part 4

This is a new series, featuring great content from our top contributors. Some of these articles are rather technical in nature, but many are business-oriented and written in simple English. The entire series consists of about 120 articles. We intend to publish a new set every two weeks or so. Click here to check out the previous edition. To read more articles from a same author, read one of his/her articles and click on his/her profile picture to access the full list. Some of these articles are curated or posted as guest blogs.

Selected Recent Articles from Top DSC Contributors

Source for picture: articled flagged with a +

DSC Resources

Selected Recent Articles from Top DSC Contributors – Part 4