Next Generation Automated Machine Learning (AML)

Next Generation Automated Machine Learning (AML)

Summary:  Automated Machine Learning has only been around for a little over two years and already there are over 20 providers in this space.  However, a new European AML platform called Tazi, new in the US, is showing what the next generation of AML will look like.


I’ve been a follower and a fan of Automated Machine Learning (AML) since it first appeared in the market about two years ago.  I wrote an article on all five of the market entrants I could find at the time under the somewhat scary title Data Scientists Automated and Unemployed by 2025!.

As time passed I tried to keep up with the new entrants.  A year later I wrote a series of update articles (links at the end of this article) that identified 7 competitors and 3 open source code packages.  It was growing.

I’ve continued to try to keep track and I’m sure the count is now over 20 including at least three new entrants I just discovered at the March Strata Data conference in San Jose.  I’m hoping that Gartner will organize all of this into a separate category soon.

Now that there are so many claimants, I do want to point out that the ones I think really earn this title are the ones that come closest to the original description of one-click-data-in-model-out.  There are a whole host of providers providing semi-AML in which there are a number of simplified steps but are not truly one-click in simplicity and completeness.


What’s the Value?

Originally the claim was that these might completely replace data scientists.  The marketing pitch was that the new Citizen Data Scientists (CDS) comprised mostly of motivated but untrained LOB managers and business analysts could simply step in and produce production quality models.  Those of us on this side of the wizard’s curtain understood that was never going to be the case. 

There is real value here though and it’s in efficiency and effectiveness.  Allowing fewer rare and expensive data scientist do the work that used to require many in a fraction of the time.  Early adopters with real data science departments like Farmers Insurance are using it in exactly this way.


The Minimum for Entry

To be part of this club the minimum functionality is the ability to automatically run many different algorithms in parallel, auto tune hyperparameters, and select and present a champion model for implementation.

This also means requiring as close to one click model building as you can get.  While there might be expert modes, the fully automated mode should benchmark well against the fully manual efforts of a data science team.

For most entrants, this has meant presenting the AML platform with a fully scrubbed flat analytic file with all the data prep and feature engineering already completed.  DataRobot (perhaps the best known or at least most successfully promoted), MLJAR, and PurePredictive are in this group.  Some have added front ends with the ability to extract, blend, clean, and even feature engineer.  Xpanse Analytics, and TIMi are some of the few with this feature.


New Next Generation Automated Machine Learning

At Strata San Jose in March I encountered a new entrant, that after a long demonstration and a conversation with their data scientist founders really appears to have broken through into the next generation of AML.

Ordinarily DataScienceCentral does not publish articles featuring just a single product or company, so please understand that this is not an endorsement.  However, I was so struck by the combination and integration of features that I think Tazi is the first in what will become the new paradigm for AML platforms.

Tazi has been in the European market for about two years; recently received a nice round of funding, and opened offices in San Francisco.

What’s different about Tazi’s approach is not any particularly unique new feature but rather the way they have fully integrated the best concepts from all around our profession into the Swiss Army knife of AML.  The value of the Swiss Army knife after all is not just that it has a blade, can opener, corkscrew, screwdriver, etc.  The break through is in the fact all these features fit together so well in just one place so we don’t have to carry around all the pieces.

Here’s a high level explanation of the pieces and the way Tazi has integrated them so nicely.


Core Automated Algorithms and Tuning

Tazi of course has the one-click central core that runs multiple algorithms in parallel and presents the champion model in a nice UI.  Hyperparameter tuning is fully automated.

The platform will take structured and unstructured data including text, numbers, time-series, networked-data, and features extracted from video.

As you might expect this means that some of the candidate algorithms range from simple GLMs and decision trees to much more advanced boosting and ensemble types and DNNs.  This also means that some of these are going to run efficiency on CPUs but some are only going to realistically be possible on GPUs.

As data volume and features scale up Tazi can take advantage of IBM and NVIDIA’s Power8, GPU and NVLink based architecture that allows super-fast communication of CPU and GPUs working together in very efficient MPP.


Problem Types

Tazi is designed to cover the spectrum of business problem types divided into three broad categories:

  • Typical consumer propensity classification problems like churn, non-performing loan prediction, collection risk, next best offer, cross sell, and other common classification tasks.
  • Time series regression value predictions such as profitability, demand, or price optimization.
  • Anomaly detection including fraud, predictive maintenance, and other IoT-type problems.


Streaming Data and Continuous Learning

The core design is based on streaming data making the platform easy to use for IoT type problems.  Don’t need streaming capability?  No problem.  It accepts batch as well.

However, one of the most interesting features of Tazi is that its developers have turned the streaming feature into Continuous Learning.  If it’s easier, think of this as Continuous Updating.  Using atomic level streaming data (as opposed to mini-batch or sliding window), each new data item received is immediately processed as part of a new data set to continuously update the models.

If you are thinking fraud, preventive maintenance, or medical monitoring you have an image of very high data thru put from many different sources or sensors, which Tazi is equipped to handle.  But even in relatively slow moving B2B or brick and mortar retail, continuous updating can mean that every time a customer order is received, a new customer registers, or a new return is processed those less frequent events will be fed immediately into the model update process.

If the data drifts sufficiently, this may mean that an entirely new algorithm will become the champion model.  More likely, with gradual drift, the parameters and variable weightings will change indicating an update is necessary.


Feature Engineering and Feature Selection

Like any good streaming platform, Tazi directly receives and blends source data.  It also performs the cleaning and prep, and performs feature engineering and feature selection.

For the most part, all the platforms that offer automated feature engineering execute by taking virtually all the date differences, ratios between values, time trends, and all the other statistical methods you might imagine to create a huge inventory of potential engineered variables.  Almost all of these, which can measure in the thousands, will be worthless.

So the real test is to produce these quickly and also to run selection algos on them quickly to focus in on those that are actually important.  Remember this is happening in a real time streaming environment where each new data item means a model update.

Automated feature engineering may never be perfect.  It’s likely that the domain knowledge of SMEs may always add some valuable new predictive variable.  Tazi says however that many times its automated approach has identified engineered variables that LOB SMEs had not previously considered.


Interpretability and the User Presentation Layer

Many of the early adopters in AML are in the regulated industries where the number of models is very large but the requirement for interpretability always requires a tradeoff between accuracy and explainability.

Here, Tazi has cleverly borrowed a page from the manual procedures many insurance companies and lenders have been using. 

Given that the most accurate model may be the product of a non-explainable procedure like DNNs, boosting, or large ensembles, how can most of this accuracy be retained in a vastly simplified explainable decision tree or GLM model?

The manual procedure in use for some time in regulated industries is to take the scoring from the complex more accurate model, along with its enhanced understanding of and engineering of variables, and use this to train an explainable simplified model.  Some of the accuracy will be lost but most can be retained.

Tazi takes this a step further.  Understanding that the company’s data science team is one thing but users are another, they have created a Citizen Data Scientist / LOB Manager / Analyst presentation layer containing the simplified model which these users can explore and actually manipulate. 

Tazi elects to visualize this mostly as sunburst diagrams that non-data scientists can quickly learn to understand and explore.


Finally, the sunburst of the existing production model is displayed next to the newly created proposed model.  LOB users are allowed to actually manipulate and change the leaves and nodes to make further intentional tradeoffs between accuracy and explainability. 

Perhaps most importantly, LOB users can explore the importance of variables and become comfortable with any new logic.  They also can be enabled to directly approve and move a revised model into production if that’s appropriate.

This separation of the data science layer from the explainable Citizen Data Scientist layer is one of Tazi’s most innovative contributions.


Implementation by API or Code

Production scoring can occur via API within the Tazi platform or Scala code of the model can be output.  This is particularly important in regulated industries that must be able to roll back their systems to see how scoring was actually occurring on some historical date in question.


Active, Human-in-the-Loop Learning

Review and recoding of specific scored instances by human workers, also known as active learning, is a well acknowledged procedure.  Crowdflower for example has built an entire service business around this, mostly in image and language understanding.

However, the other domain where this is extremely valuable is in anomaly detection and the correct classification of very rare events.

Tazi has an active learning module built directly into the platform so that specified questionable instances can be reviewed by a human and the corrected scoring fed back into the training set to improve model performance.  This eliminates the requirement of extracting these items to be reviewed and moving them on to a separate platform like Crowdflower.

This semi-supervised learning is a well acknowledged procedure in improving anomaly detection.


Like the Swiss Army Knife – Nicely Integrated

There is much to like here in how all these disparate elements have been brought together and integrated.  It doesn’t necessarily take a break through in algorithms or data handling to create a new generation.  Sometimes it’s just a well thought out wide range of features and capabilities, even if there’s no corkscrew.


Previous articles on Automated Machine Learning

More on Fully Automated Machine Learning  (August 2017)

Automated Machine Learning for Professionals  (July 2017)

Data Scientists Automated and Unemployed by 2025 – Update!  (July 2017)

Data Scientists Automated and Unemployed by 2025!  (April 2016)



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:


Next Generation Automated Machine Learning (AML)

GDPR and the Paradox of Interpretability

GDPR and the Paradox of Interpretability

Summary:  GDPR carries many new data and privacy requirements including a “right to explanation”.  On the surface this appears to be similar to US rules for regulated industries.  We examine why this is actually a penalty and not a benefit for the individual and offer some insight into the actual wording of the GDPR regulation which also offers some relief.


GDPR is now just about 60 days away and there’s plenty to pay attention to especially in getting and maintaining permission to use a subscriber’s data.  It’s a tough problem.

If you’re an existing social media platform like Facebook that’s a huge change to your basic architecture.  If you’re just starting out in the EU there are some new third party offerings that promise to keep track of things for you (Integris, Kogni, and Waterline all emphasized this feature at the Strata Data San Jose conference this month).

The item that we keep coming back to however is the “right to explanation”.

In the US we’re not strangers to this requirement if you’re in one of the regulated industries like lending or insurance where you already bear this burden.  US regulators have been pretty specific that this means restricting machine learning to techniques to the simplest regression and decision trees that have the characteristic of being easy to understand and are therefore judged to be ‘interpretable’.

Recently University of Washington professor and leading AI researcher Pedro Domingos created more than a little controversy when he tweeted “Starting May 25, the European Union will require algorithms to explain their output, making deep learning illegal”.  Can this be true?


Trading Accuracy for Interpretability Adds Cost

The most significant issue is that restricting ourselves to basic GLM and decision trees directly trades accuracy for interpretability.  As we all know, very small changes in model accuracy can leverage into much larger increases in the success of different types of campaigns and decision criteria.  Intentionally forgoing the benefit of that incremental accuracy imposes a cost on all of society.

We just barely ducked the bullet of ‘disparate impact’ that was to have been a cornerstone of new regulation proposed by the last administration.  Fortunately those proposed regs were abandoned as profoundly unscientific.

Still the costs of dumbing down analytics keep coming.  Basel II and Dodd-Frank place a great deal of emphasis on financial institutions constantly evaluating and adjusting their capital requirements for risk of all sorts. 

This has become so important that larger institutions have had to establish independent Model Validation Groups (MVGs) separate from their operational predictive analytics operation whose sole role is to constantly challenge whether the models in use are consistent with regulations.  That’s a significant cost of compliance.


The Paradox:  Increasing Interpretability Can Reduce Individual Opportunity

Here’s the real paradox.  As we use less accurate techniques to model, that inaccuracy actually excludes some individuals who would have been eligible for credit, insurance, a loan, or other regulated item, and includes some other individuals whose risk should have invalidated them for selection.  This last increases the rate of bad debt or other costs of bad decisions that gets reflected in everyone’s rates.

At the beginning of 2017, Equifax, the credit rating company quantified this opportunity/cost imbalance.  Comparing the mandated simple models to modern deep learning techniques they reexamined the last 72 months of their data and decisions.

Peter Maynard, Senior Vice President of Global Analytics at Equifax says the experiment improved model accuracy 15% and reduced manual data science time by 20%. 


The ‘Right to Explanation’ is Really No Benefit to the Individual

Regulators apparently think that rejected consumers should be consoled by this proof that the decision was fair and objective.

However, if you think through to the next step, what is the individual’s recourse?  The factors in any model are objective, not subjective.  It’s your credit rating, your income, your driving history, all facts that you cannot change immediately in order to qualify.

So the individual who has exercised this right gained nothing in terms of immediate access to the product they desired, and quite possibly lost out on qualifying had a more accurate modeling technique been used.

Peter Maynard of Equifax goes on to say that after reviewing the last two years of data in light of the new model they found many declined loans that could have been made safely. 


Are We Stuck With the Simplest Models?

Data scientists in regulated industries have been working this issue hard.  There are some specialized regression techniques like Penalized Regression, Generalized Additive Models, and Quantile Regression all of which yield somewhat better and still interpretable results.

This last summer, a relatively new technique called RuleFit Ensemble Models was gaining prominence and also promised improvement.

Those same data scientists have also been clever about using black box deep neural nets first to model the data, achieving the most accurate models, and then using those scores and insights to refine and train simpler techniques.

Finally, that same Equifax study quoted above also resulted in a proprietary technique to make deep neural nets explainable.  Apparently Equifax has persuaded some of their regulators to accept this new technique but are so far keeping it to themselves.  Perhaps they’ll share.


The GDPR “Right to Explanation” Loophole

Thanks to an excellent blog by Sandra Wachter who is an EU Lawyer and Research Fellow at Oxford we discover that the “right to explanation” may not be all that it seems.

It seems that a legal interpretation of “right to explanation” is not the same as in the US.  In fact, per Wachter’s blog, “the GDPR is likely to only grant individuals information about the existence of automated decision-making and about “system functionality”, but no explanation about the rationale of a decision.”

Wachter goes on to point out that “right to explanation” was written into a section called Recital 71 which is important because that section is meant as guidance but carries no legal basis to establish stand-alone rights.  Wachter observes that this placement appears intentional indicating that legislators did not want to make this a right on the same level as other elements of the GDPR.


Are We Off the Hook for Individual Level Explanations?

At least in the EU, the legal reading of “right to explanation” seems to give us a clear pass.  Will that hold?  That’s completely up to the discretion of the EU legislators, but at least for now, as written, “right to explanation” should not be a major barrier to operations in the post GDPR world.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:


GDPR and the Paradox of Interpretability

What Comes After Deep Learning

What Comes After Deep Learning

Summary: We’re stuck.  There hasn’t been a major breakthrough in algorithms in the last year.  Here’s a survey of the leading contenders for that next major advancement.


We’re stuck.  Or at least we’re plateaued.  Can anyone remember the last time a year went by without a major notable advance in algorithms, chips, or data handling?  It was so unusual to go to the Strata San Jose conference a few weeks ago and see no new eye catching developments.

As I reported earlier, it seems we’ve hit maturity and now our major efforts are aimed at either making sure all our powerful new techniques work well together (converged platforms) or making a buck from those massive VC investments in same.

I’m not the only one who noticed.  Several attendees and exhibitors said very similar things to me.  And just the other day I had a note from a team of well-regarded researchers who had been evaluating the relative merits of different advanced analytic platforms, and concluding there weren’t any differences worth reporting.


Why and Where are We Stuck?

Where we are right now is actually not such a bad place.  Our advances over the last two or three years have all been in the realm of deep learning and reinforcement learning.  Deep learning has brought us terrific capabilities in processing speech, text, image, and video.  Add reinforcement learning and we get big advances in game play, autonomous vehicles, robotics and the like.

We’re in the earliest stages of a commercial explosion based on these like the huge savings from customer interactions through chatbots; new personal convenience apps like personal assistants and Alexa, and level 2 automation in our personal cars like adaptive cruise control, accident avoidance braking, and lane maintenance.

Tensorflow, Keras, and the other deep learning platforms are more accessible than ever, and thanks to GPUs, more efficient than ever.

However, the known list of deficiencies hasn’t moved at all.

  1. The need for too much labeled training data.
  2. Models that take either too long or too many expensive resources to train and that still may fail to train at all.
  3. Hyperparameters especially around nodes and layers that are still mysterious. Automation or even well accepted rules of thumb are still out of reach.
  4. Transfer learning that means only going from the complex to the simple, not from one logical system to another.

I’m sure we could make a longer list.  It’s in solving these major shortcomings where we’ve become stuck.


What’s Stopping Us

In the case of deep neural nets the conventional wisdom right now is that if we just keep pushing, just keep investing, then these shortfalls will be overcome.  For example, from the 80’s through the 00’s we knew how to make DNNs work, we just didn’t have the hardware.  Once that caught up then DNNs combined with the new open source ethos broke open this new field.

All types of research have their own momentum.  Especially once you’ve invested huge amounts of time and money in a particular direction you keep heading in that direction.  If you’ve invested years in developing expertise in these skills you’re not inclined to jump ship. 


Change Direction Even If You’re Not Entirely Sure What Direction that Should Be

Sometimes we need to change direction, even if we don’t know exactly what that new direction might be.  Recently leading Canadian and US AI researchers did just that.  They decided they were misdirected and needed to essentially start over.

This insight was verbalized last fall by Geoffrey Hinton who gets much of the credit for starting the DNN thrust in the late 80s.  Hinton, who is now a professor emeritus at the University of Toronto and a Google researcher, said he is now deeply suspicious of back propagation, the core method that underlies DNNs.  Observing that the human brain doesn’t need all that labeled data to reach a conclusion, Hinton says “My view is throw it all away and start again”.

So with this in mind, here’s a short survey of new directions that fall somewhere between solid probabilities and moon shots, but are not incremental improvements to deep neural nets as we know them.

These descriptions are intentionally short and will undoubtedly lead you to further reading to fully understand them.


Things that Look Like DNNs but are Not

There is a line of research closely hewing to Hinton’s shot at back propagation that believes that the fundamental structure of nodes and layers is useful but the methods of connection and calculation need to be dramatically revised.


Capsule Networks (CapsNet)

It’s only fitting that we start with Hinton’s own current new direction in research, CapsNet.  This relates to image classification with CNNs and the problem, simply stated, is that CNNs are insensitive to the pose of the object.  That is, if the same object is to be recognized with differences in position, size, orientation, deformation, velocity, albedo, hue, texture etc. then training data must be added for each of these cases.

In CNNs this is handled with massive increases in training data and/or increases in max pooling layers that can generalize, but only by losing actual information.

The following description comes from one of many good technical descriptions of CapsNets, this one from Hackernoon.

Capsule is a nested set of neural layers. So in a regular neural network you keep on adding more layers. In CapsNet you would add more layers inside a single layer. Or in other words nest a neural layer inside another. The state of the neurons inside a capsule capture the above properties of one entity inside an image. A capsule outputs a vector to represent the existence of the entity. The orientation of the vector represents the properties of the entity. The vector is sent to all possible parents in the neural network. Prediction vector is calculated based on multiplying its own weight and a weight matrix. Whichever parent has the largest scalar prediction vector product, increases the capsule bond. Rest of the parents decrease their bond. This routing by agreement method is superior to the current mechanism like max-pooling.

CapsNet dramatically reduces the required training set and shows superior performance in image classification in early tests.



In February we featured research by Zhi-Hua Zhou and Ji Feng of the National Key Lab for Novel Software Technology, Nanjing University, displaying a technique they call gcForest.  Their research paper shows that gcForest regularly beats CNNs and RNNs at both text and image classification.  The benefits are quite significant.

  • Requires only a fraction of the training data.
  • Runs on your desktop CPU device without need for GPUs.
  • Trains just as rapidly and in many cases even more rapidly and lends itself to distributed processing.
  • Has far fewer hyperparameters and performs well on the default settings.
  • Relies on easily understood random forests instead of completely opaque deep neural nets.

In brief, gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests.  Read more about gcForest in our original article.


Pyro and Edward

Pyro and Edward are two new programming languages that merge deep learning frameworks with probabilistic programming.  Pyro is the work of Uber and Google, while Edward comes out of Columbia University with funding from DARPA.  The result is a framework that allows deep learning systems to measure their confidence in a prediction or decision.

In classic predictive analytics we might approach this by using log loss as the fitness function, penalizing confident but wrong predictions (false positives).  So far there’s been no corollary for deep learning.

Where this promises to be of use for example is in self-driving cars or aircraft allowing the control to have some sense of confidence or doubt before making a critical or fatal catastrophic decision.  That’s certainly something you’d like your autonomous Uber to know before you get on board.

Both Pyro and Edward are in the early stages of development.


Approaches that Don’t Look Like Deep Nets

I regularly run across small companies who have very unusual algorithms at the core of their platforms.  In most of the cases that I’ve pursued they’ve been unwilling to provide sufficient detail to allow me to even describe for you what’s going on in there.  This secrecy doesn’t invalidate their utility but until they provide some benchmarking and some detail, I can’t really tell you what’s going on inside.  Think of these as our bench for the future when they do finally lift the veil.

For now, the most advanced non-DNN algorithm and platform I’ve investigated is this:


Hierarchical Temporal Memory (HTM)

Hierarchical Temporal Memory (HTM) uses Sparse Distributed Representation (SDR) to model the neurons in the brain and to perform calculations that outperforms CNNs and RNNs at scalar predictions (future values of things like commodity, energy, or stock prices) and at anomaly detection.

This is the devotional work of Jeff Hawkins of Palm Pilot fame in his company Numenta.  Hawkins has pursued a strong AI model based on fundamental research into brain function that is not structured with layers and nodes as in DNNs.

HTM has the characteristic that it discovers patterns very rapidly, with as few as on the order of 1,000 observations.  This compares with the hundreds of thousands or millions of observations necessary to train CNNs or RNNs.

Also the pattern recognition is unsupervised and can recognize and generalize about changes in the pattern based on changing inputs as soon as they occur.  This results in a system that not only trains remarkably quickly but also is self-learning, adaptive, and not confused by changes in the data or by noise.

We featured HTM and Numenta in our February article and we recommend you read more about it there.


Some Incremental Improvements of Note

We set out to focus on true game changers but there are at least two examples of incremental improvement that are worthy of mention.  These are clearly still classical CNNs and RNNs with elements of back prop but they work better.


Network Pruning with Google Cloud AutoML

Google and Nvidia researchers use a process called network pruning to make a neural network smaller and more efficient to run by removing the neurons that do not contribute directly to output. This advancement was rolled out recently as a major improvement in the performance of Google’s new AutoML platform.



Transformer is a novel approach useful initially in language processing such as language-to-language translations which has been the domain of CNNs, RNNs and LSTMs.  Released late last summer by researchers at Google Brain and the University of Toronto, it has demonstrated significant accuracy improvements in a variety of test including this English/German translation test. 

The sequential nature of RNNs makes it more difficult to fully take advantage of modern fast computing devices such as GPUs, which excel at parallel and not sequential processing.  CNNs are much less sequential than RNNs, but in CNN architectures the number of steps required to combine information from distant parts of the input still grows with increasing distance.

The accuracy breakthrough comes from the development of a ‘self-attention function’ that significantly reduces steps to a small, constant number of steps. In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position.

Read the original research paper here.


A Closing Thought

If you haven’t thought about it, you should be concerned at the massive investment China is making in AI and its stated goal to overtake the US as the AI leader within a very few years. 

In an article by Steve LeVine who is Future Editor at Axios and teaches at Georgetown University he makes the case that China may be a fast follower but will probably never catch up.  The reason, because US and Canadian researchers are free to pivot and start over anytime they wish.  The institutionally guided Chinese could never do that.  This quote from LeVine’s article:

“In China, that would be unthinkable,” said Manny Medina, CEO at in Seattle.  AI stars like Facebook’s Yann LeCun and the Vector Institute’s Geoff Hinton in Toronto, he said, “don’t have to ask permission. They can start research and move the ball forward.”

As the VCs say, maybe it’s time to pivot.


About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:


What Comes After Deep Learning

Make Your Chatbot Smarter by Talking to It

Make Your Chatbot Smarter by Talking to It

Summary:  A major problem with chatbots is that they can only provide information from what’s in their knowledge base.  Here’s a new approach that makes your chatbot smarter with every question it can’t answer, making it a self-learning lifelong learner.


If you’ve been keeping up with the explosive growth in chatbots you probably already know that there are two basic architectures:

Rules-based Chatbots representing over 90% of what’s currently available.  They are relatively simple and fast to build, with decision-tree or waterfall-like logic structures of predefined queries and responses. 

AI Chatbots use deep learning engines to formulate responses.  They do not have rigidly defined structures and are able to learn conversational responses after some initial training.

And while the application of NLU in both cases is exceedingly clever, they both have an Achilles heel – they don’t know what they don’t know.  That is to say that a chatbot can only respond based on what’s in its knowledge base (KB).


Many chatbot applications can be made very effective with limited content.  If you’re building a recommender it can access your full library of the products you offer along with their prices and specifications.  If you’re reserving an airline flight, there are a finite number of offerings available between any two cities.  But as you expand outward from these limited applications to very sophisticated question answering machines (like Alexa and Siri may one day become) then your knowledge base becomes extremely large, perhaps even encompassing the entire web.  Right now that doesn’t work.  However, that’s where we want to go.

This isn’t simple search where your chatbot can respond with a long list of things you may have meant.  Like IBM’s Watson it needs to respond with the single most correct answer.  And that answer better make sense or your users won’t come back.


How Your Chatbot Finds the Right Answer

Matching your user’s query with facts in the knowledge base is complex but reasonably well developed.  The procedure is formally known as ‘Context Aware Path Ranking’ or C-PR for short. 

That procedure creates both a knowledge graph and a knowledge matrix linking a ‘source entity’ (what your user asked about) with the ‘target entity’ (what’s in the knowledge base) through a logical relationship (like ‘is found in’ or ‘is a member of’ or ‘is commonly used together’).  In fact it is expressed as a ‘query triplet’ shown as (s, r, t) where s is the source entity, r the relationship, and t the target entity.

You can see intuitively that if your version of Alexa is meant to be able to answer every question that could possibly come to mind that the knowledge graph and matrix would be impossibly large.  So just as Alexa and Siri are intentionally limited in their domain knowledge, the question remains how to push outward into larger and larger areas of knowledge.


What Do Your Users Really Want to Know

This question determines how big your knowledge base should be.  It’s unlikely that your users want to be able to ask absolutely anything based on how you’ve positioned your chatbot.  One approach might be to try to define this in advance.  That is during development to figure out just how big that knowledge base needs to be.  But history shows that however carefully we plan we always miss something.  We’ll include data that not important and exclude data our users want to know.

A pretty obvious solution is to take note of what users ask that we can’t answer and add that to the knowledge base.  In research this is called KBC, Knowledge Base Completion.  And these techniques work but only if all three elements of the query triplet ((s, r, t) where s is the source entity, r the relationship, and t the target entity) already exist in your knowledge base but simply haven’t been mapped yet.  Map them on the knowledge graph and the knowledge matrix and they’re available to your users.


Here’s What’s Really Needed

If your user asks a question or makes a statement in which any of the S, R, T elements are not present the chatbot can’t respond.  What’s needed is a system where the chatbot can continuously learn from new questions and incorporate them automatically into the knowledge base.

In other words, anything you say to or ask of your chatbot should make it smarter just by talking to it.


The Breakthrough

Fortunately, three researchers from the University of Illinois at Chicago, Sahisnu Mazumder, Nianzu Ma, Bing Liu have just published the results of their work that opens up this possibility. 

The study “Towards an Engine for Lifelong Interactive Knowledge Learning in Human-Machine Conversations” is all about a new technique called lifelong interactive learning and inference (LiLi), imitating how humans acquire knowledge and perform inference during an interactive conversation. 

Here the ‘S’ (what the user want to know – an entity) is captured from the input conversation and the ‘R’ (relationship) and the ‘T’ (target entity) are discovered by a clever combination of reinforcement learning and LSTM deep learning models in real time.

LiLi becomes the lifelong learning component that adds to your chatbots knowledge base every time a user makes a statement or asks a question that’s not currently in the KB.


How It Works

The problem divides logically into two parts.  If the user makes a statement (e.g. Obama was born in the USA) the system makes a query of the KB (Obama, BornIn, USA) to determine whether this ‘true fact’ is present.  If not it is added. 

Suppose however that the triplet that is already in the KB is (Obama, CitizenOf, USA) but not the triplet containing the ‘BornIn’ relationship.  Then the RL/LSTM program will, over time, cause the system to recognize the high likelihood that ‘CitizenOf’ and ‘BornIn’ are logical equivalents in this context.

The methods by which the entities and relationships are extracted from the conversation are a separate NLU process you can review here.

The second case however is the more difficult.  This occurs when the input is a query where either or both the relationship or the entities are not in the KB.  For example, if the user asks “was Obama born in the USA?” how do we proceed if ‘Obama’ or ‘born in’ or ‘USA’ are not already in the KB?

Upon receiving the query with unknown components LiLi executes a series of interleaved steps involving:

  • An Inference Strategy: Can a chain of inference be established from existing KB triplets that have a high likelihood of matching.  For example, discovering that ‘CitizenOf’ and ‘BornIn’ have a high likelihood of equivalence.
  • Asking Questions for Clarification: Like human beings in conversation, if we don’t clearly understand our partner’s question we’ll ask clarifying questions.  After each clarifying question LiLi, like humans, will once again loop through the inference strategy and decide if additional clarifying questions are necessary.

Needless to say, deciding which clarifying questions are appropriate and how many times you can go back to the user to ask them are not trivial issues that are addressed by LiLi.

Here are the actions that are available in LiLi in the typical order in which they would occur.

  1. Search source (h), target (t) entities and query relation (r) in KB.
  2. Ask user to provide an example/clue for query relationship (r)
  3. Ask user to provide missing link for path completion feature.
  4. Ask user to provide the connecting link for augmenting a new entity in KB.
  5. Extract path features between source (s) and target (t) entities using C-PR.
  6. Store query data instance in data buffer and invoke prediction model for inference.

An exchange using the Obama example might occur as follows (the annotation “SFn” indicates a clue necessary for the inference strategy). 

The Data Science

In brief, the RL model which uses Q-learning has the goal to formulate a strategy that makes the inference task possible.  LiLi’s strategy formulation is modeled as a Markov Decision Process.  LiLi also uses an LSTM to create a vector representation of each feature.

The system also contains a routine for ‘guessing’ in the event the user is not able to offer a clue or example which is reported to be significantly better than a coin toss.

See the original study here.

LiLi represents a significant step forward in making our chatbots smarter without artificially inflating the knowledge base beyond what our users really want to know.


Other articles in this series:

Beginners Guide to Chatbots

Under the Hood With Chatbots

Chatbot Best Practices – Making Sure Your Bot Plays Well With Users


About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:



Make Your Chatbot Smarter by Talking to It

Advanced Analytics Platforms – Big Changes in the Leaderboard

Advanced Analytics Platforms – Big Changes in the Leaderboard

Summary: The Magic Quadrant for Advanced Analytic and ML Platforms is just out and there are some big changes in the leaderboard.  Not only are there some surprising upgrades but some equally notable long falls.


The Gartner Magic Quadrant for Advanced Analytic and ML Platforms came out on February 22nd and there are some big changes in the leaderboard.  Not only are there some surprising upgrades (Alteryx, KNIME, but some equally notable long falls for traditional players (IBM, Dataiku, and Teradata).

Blue dots are 2018, gray dots 2017.

For those of you keeping track Gartner split this field in 2017 so that “Advanced Analytic & Machine Learning Platforms” (Machine Learning added just this year) is separate from “Business Intelligence and Analytic Platforms”.  There’s significant overlap in the players appearing in both including SAS, IBM, Microsoft, SAP, Alteryx, and a number of others.  Additionally you’ll find all the data viz offerings there like Tableau and Qlik.

From a data science perspective though we want to keep our eye on the Advanced Analytics platforms.  In prior years the changes had been much more incremental.  This year’s big moves seem out of character, so we dived into the detail to see if the scoring had changed or if the nature of the offerings was the key.


Has the Scoring Changed?

We read the 2018 and 2017 reports side-by-side looking for any major changes in scoring that might explain these moves.  We didn’t find any.  The scoring criteria and eligibility for each year remain essentially unchanged.

Of course raters are always influenced by changes in the market and such impacts can be subtle.  In the narrative explanation of markets and capabilities we found only a few hints at how scoring might have been impacted.


New Emphasis on Integrated Machine Learning

We all want our platforms and their components to operate seamlessly.  Last year the criteria was perhaps a bit looser with Gartner looking for “reasonably interoperable” components.  This year there is much more emphasis on a fully integrated pipeline from accessing and analyzing data through to operationalizing models and managing content.


Machine Learning is a Key Component – AI Gets Noticed but Not Scored

It was important that ML capability be either included in the platform or easily accessed through open source libraries.  To their credit, Gartner does not fall into the linguistic trap of conflating Machine Learning with AI.  They define the capabilities they are looking for as including “support for modern machine-learning approaches like ensemble techniques (boosting, bagging and random forests) and deep learning”.

They acknowledge the hype around AI but draw a relatively firm boundary between AI and ML, with ML as an enabler of AI.  Note for example that deep learning was included above.  I’m sure we’re only a year or two away from seeing more specific requirements around AI finding their way into this score.


Automated ML

Gartner is looking for features that facilitate some portion of the process like feature generation or hyperparameter tuning.  Many packages contain some limited forms of these.

While some of the majors like SAS and SPSS have introduced more and more automation into their platforms, none of the pure-play AML platforms are yet included.  DataRobot gets honorable mention as does Amazon (presumably referring to their new SageMaker offering).  I expect within one or two years at least one pure play AML platform will make this list.


Acquisition and Consolidations

Particularly among challengers, adding capability through acquisition continues to be a key strategy though none of these seemed to move the needle much in this year.

Notable acquisitions called out by Gartner for this review include DataRobot’s acquisition of Nutonian, Progress’ acquisition of DataRPM, and TIBCO Software’s acquisition of Statistica (from Quest Software) and Alpine Data.

Several of these consolidations had the impact of taking previously ranked players off the table presumably providing room for new competitors to be ranked.


Big Winners and Losers

So if the difference is not in the scoring it must be in the detail of the offerings.  The three that really caught our eye were the rise of Alteryx and into the Leaders box and the rapid descent of IBM out.



From its roots in data blending and prep, Alteryx has continuously added to its on-board modeling and machine learning capabilities.  In 2017 it acquired Yhat that rounded out its capabilities in model deployment and management.  Then, thanks to the capital infusion from its 2017 IPO it upped its game in customer acquisition.

Alteryx’ vision has always been an easy to use platform allowing LOB managers and citizen data scientists to participate.  Gartner also reports very high customer satisfaction. 

Last year Gartner described this as a ‘land and expand’ strategy moving customers from self-service data acquisition all the way over to predictive analytics.  This win has been more like a Super Bowl competitor working their way up field with good offense and defense.  Now they’ve made the jump into the ranks of top contenders. moved from visionary to leader based on improvements in their offering and execution.  This is a coder’s platform and won’t appeal to LOB or CDS audiences.  However, thanks to being mostly open source they’ve acquired a reputation as thought leaders and innovators in the ML and deep learning communities.

While a code-centric platform does not immediately bring to mind ease of use, the Deep Water module offers a deep learning front end that abstracts away the back end details of TensorFlow and other DL packages.  They also may be on the verge of creating a truly easy to use DL front end which they’ve dubbed ‘Driverless AI’.  They also boast a best-in-class Spark integration.

Their open source business model in which practically everything can be downloaded for free has historically crimped their revenue which continues to rely largely on subscription based technical support.  However, Gartner reports has 100,000 data scientist users and a strong partner group including Angoss, IBM, Intel, RapidMiner, and TIBCO, which along with its strong technical position makes it revenue prospect stronger.



Just last year IBM pulled ahead of SAS as the enthroned leader of all the vendors in this segment.  This year saw a pretty remarkable downgrade on ability to execute and also on completeness of vision.  Even so, with its huge built in customer base it continues to command 9.5% of the market in this segment.

Comparing last year’s analysis with this year’s, it seems that IBM has just gotten ahead of itself in too many new offerings.  The core Gartner rating remains based on the solid SPSS product but notes that it seems dated and expensive to some customer.  Going back to 2015 IBM had expanded the Watson brand which used to be exclusive to its famous Question Answering Machine to cover, confusingly, a greatly expanded group of analytic products.  Then in 2016 IBM doubled down on the confusion by introducing their DSx (Data Science Experience) platform as a separate offering primarily aimed at open source coders.

The customers that Gartner surveyed in 2017 for this year’s rating just couldn’t figure it out.  Too many offerings got in the way of support and customer satisfaction, though Gartner looked past the lack of integration to give some extra points for vision.

IBM could easily bounce back if they clear up this multi-headed approach, show us how it’s supposed to be integrated into one offering, and then provide the support to make the transition.  Better luck next year.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:


Advanced Analytics Platforms – Big Changes in the Leaderboard

Off the Beaten Path – HTM-based Strong AI Beats RNNs and CNNs at Prediction and Anomaly Detection

Off the Beaten Path – HTM-based Strong AI Beats RNNs and CNNs at Prediction and Anomaly Detection

Summary: This is the second in our “Off the Beaten Path” series looking at innovators in machine learning who have elected strategies and methods outside of the mainstream.  In this article we look at Numenta’s unique approach to scalar prediction and anomaly detection based on their own brain research.


Numenta, the machine intelligence company founded in 2005 by Jeff Hawkins of Palm Pilot fame might well be the poster child for ‘off the beaten path’.  More a research laboratory than commercial venture, Hawkins has been pursuing a strong-AI model of computation that will at once directly model the human brain, and as a result be a general purpose solution to all types of machine learning problems.

After swimming against the tide of the ‘narrow’ or ‘weak’ AI approaches represented by deep learning’s CNNs and RNN/LSTMs his bet is starting to pay off.  There are now benchmarked studies showing that Numenta’s strong AI computational approach can outperform CNN/RNN based deep learning (which Hawkins characterizes as ‘classic’ AI) at scalar predictions (future values of things like commodity, energy, or stock prices) and at anomaly detection.


How Is It Different from Current Deep Learning

The ‘strong AI’ approach pursued by Numenta relies on computational models drawn directly from the brain’s own architecture. 

‘Weak AI’ by contrast, represented by the full panoply of deep neural nets, acknowledges that it is only suggestive of true brain function, but it gets results. 

We are all aware of the successes in image, video, text, and speech analysis that CNNs and RNN/LSTMs have achieved and that is their primary defense.  They work.  They give good commercial results.  But we are also beginning to recognize their weaknesses: large training sets, susceptibility to noise, long training times, complex setup, inability to adapt to changing data, and time invariance that begin to show us where the limits of their development will lead us.

Numenta’s computational approach has a few similarities to these and many unique contributions that require those of us involved in deep learning to consider a wholly different computational paradigm.


Hierarchical Temporal Memory (HTM)

It would take several articles of this length to do justice to the methods introduced by Numenta.  Here are the highlights.

HTM and Time:  Hawkins uses the term Hierarchical Temporal Memory (HTM) to describe his overall approach.  The first key to understanding is that HTM relies on data that streams over time.  Quoting from previous interviews, Hawkins says,

“The brain does two things: it does inference, which is recognizing patterns, and it does behavior, which is generating patterns or generating motor behavior.  Ninety-nine percent of inference is time-based – language, audition, touch – it’s all time-based. You can’t understand touch without moving your hand. The order in which patterns occur is very important.”

So the ‘memory’ element of HTM is how the brain differently interprets or relates each sequential input, also called ‘sequence memory’.

By contrast, conventional deep learning uses static data and is therefore time invariant.  Even RNN/LSTMs that process speech, which is time based, actually do so on static datasets.


Sparse Distributed Representations (SDRs):  Each momentary input to the brain, for example from the eye or from touch is understood to be some subset of all the available neurons in that sensor (eye, ear, finger) firing and forwarding that signal upward to other neurons for processing.

Since not all the available neurons fire for each input, the signal sent forward can be seen as a Sparse Distributed Representation (SDR) of those that have fired (the 1s in the binary representation) versus the hundreds or thousands that have not (the 0s in the binary representation).  We know from research that on average only about 2% of neurons fire with any given event giving meaning to the term ‘sparse’.

SDRs lend themselves to vector arrays and because they are sparse have the interesting characteristics that they can be extensively compressed without losing meaning and are very resistant to noise and false positives.

By comparison, deep neural nets fire all neurons in each layer, or at least those that have reached the impulse threshold.  This is an acknowledged drawback by current researchers in moving DNNs much beyond where they are today.


Learning is Continuous and Unsupervised:  Like CNNs and RNNs this is a feature generating system that learns from unlabeled data.

When we look at diagrams of CNNs and RNNs they are typically shown as multiple (deep) layers of neurons which decrease in pyramidal fashion as the signal progresses.  Presumably discovering features in this self-constricting architecture down to the final classifying layer.

HTM architecture by contrast is simply columnar with columns of computational neurons passing the information on to upward layers in which pattern discovery and recognition occurs organically by the comparison of one SDR (a single time signal) to the others in the signal train. 

HTM has the characteristic that it discovers these patterns very rapidly, with as few as on the order of 1,000 SDR observations.  This compares with the hundreds of thousands or millions of observations necessary to train CNNs or RNNs.

Also the pattern recognition is unsupervised and can recognize and generalize about changes in the pattern based on changing inputs as soon as they occur.  This results in a system that not only trains remarkably quickly but also is self-learning, adaptive, and not confused by changes in the data or by noise.

Numenta offers a deep library of explanatory papers and YouTube videos for those wanting to experiment hands-on.


Where Does HTM Excel

HTM has for many years been a work in progress.  That has recently changed.  Numenta has published several peer reviewed performance papers and established benchmarking in areas of its strength that highlight its superiority over traditional DNNs and other ML methods on particular types of problems.

In general, Numenta says that the current state of its technology represented by its open source project NuPIC (Numenta Platform for Intelligent Computing) currently excels in three areas:

Anomaly Detection in streaming data.  For example:

  • Highlighting anomalies in the behavior of moving objects, such as tracking a fleet’s movements on a truck by truck basis using geospatial data.
  • Understanding if human behavior is normal or abnormal on a securities trading floor.
  • Predicting failure in a complex machine based on data from many sensors.

Scalar Predictions, for example:

  • Predicting energy usage for a utility on a customer by customer basis.
  • Predicting New York City taxi passenger demand 2 ½ hours in advance based on a public data stream provided by the New York City Transportation Authority.

Highly Accurate Semantic Search on static and streaming data (these examples are from Corticol.Io a Numenta commercial partner using the SDR concept but not NuPICS).

  • Automate extraction of key information from contracts and other legal documents.
  • Quickly find similar cases to efficiently solve support requests.
  • Extract topics from different data sources (e.g. emails, social media) and determine customers’ intent.
  • Terrorism Prevention: Monitor all social media messages alluding to terrorist activity even if they don’t use known keywords.
  • Reputation Management: Track all social media posts mentioning a business area or product type without having to type hundreds of keywords.


Two Specific Examples of Performance Better Than DNNs

Taxi Demand Forecast

In this project, the objective was to predict the demand for New York City taxi service 2 ½ hours in advance based on a public data stream provided by the New York City Transportation Authority.  This was based on historical streaming data at 30 minutes intervals using the previous 1,000, 3,000, or 6,000 observations as the basis for the forward projection 5 periods (2 ½ hours) in advance.  The study (which you can see here) compared ARIMA, TDNN, and LSTM to the HTM method where HTM demonstrated the lowest error rate.



Machine Failure Prediction (Anomaly)

The objective of this test was to compare two of the most popular anomaly detection routines (Twitter ADVec and Etsy Skyline) against HTM in a machine failure scenario.  In this type of IoT application it’s important that the analytics detect all the anomalies present, detect them as far before occurrence as possible, trigger no false alarms (false positives), and work with real time data.  A full description of the study can be found here.

The results showed that the Numenta HTM outperformed the other methods by a significant margin. 

Even more significantly, as noted in the caption below, the Numenta HTM method identified the potential failure a full 3 hours before the other techniques.


You can find other benchmark studies on the Numenta site.


The Path Forward

Several things are worthy of note here since as we mentioned earlier the Numenta HTM platform is still a work in progress.



Numenta’s business model currently calls for it to be the center of a commercial ecosystem while retaining its primary research focus.  Currently Numenta has two commercially licensed partners, Corticol.Io which focuses on streaming text and semantic interpretation.  The second is Grok ( which has adapted the NuPIC core platform for anomaly detection in all types of IT operational scenarios.  The core NuPICs platform is open source if you’re motivated to experiment with potential commercial applications.


Image and Text Classification

A notable absence from the current list of capabilities is image and text classification.  There are no current plans for Numenta to develop image classification from static data since that is not on the critical path defined by streaming data.  It’s worth noting that others have demonstrated the use of HTM as a superior technique for image classification not using the NuPICs platform.


Near Term Innovation

In my conversation with Christy Maver, Numenta’s VP of Marketing she expressed that they are confident that they will have a fairly complete framework for how the neocortex works within the timeframe of perhaps a year.  This last push is in the area of sensorimotor integration that would be the core concept in applying the HTM architecture to robotics.

For commercial development, the focus will be on partners to license the IP.  Even IBM established a Cortical Research Center a few years back staffed with about 100 researchers to examine the Numenta HTM approach.  Like so many others now trying to advance AI by more closely modeling the brain, IBM, like Intel and others has moved off in the direction of specialty chips that fall in the category on neuromorphic or spiking chips.   Brainchip out of Irvine already has a spiking neuromorphic chip in commercial use.  As Maver alluded, there may be a silicon representation of Numenta’s HTM in the future.



Other articles in this series:

Off the Beaten path – Using Deep Forests to Outperform CNNs and RNNs



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

Off the Beaten Path – HTM-based Strong AI Beats RNNs and CNNs at Prediction and Anomaly Detection