13 Great Articles from AnalyticBridge

13 Great Articles from AnalyticBridge

AnalyticBridge is one of Data Science Central channels. Below is a selection of popular articles posted a while back:

Enjoy the reading!

13 Great Articles from AnalyticBridge

Quantum Computing and Deep Learning. How Soon? How Fast?

Quantum Computing and Deep Learning. How Soon? How Fast?

Summary:  Quantum computing is now a commercial reality.  Here’s the story of the companies that are currently using it in operations and how this will soon disrupt artificial intelligence and deep learning.


Like a magician distracting us with one hand while pulling a fast one with the other Quantum computing has crossed over from research to commercialization almost without us noticing.

Is that right?  Has the dream of Quantum computing actually stepped out of the lab into the world of actual application.  Well, Lockheed Martin has been using it for seven years.  Temporal Defense Systems, a leading edge cyber security firm is using one.  And two Australian banks, Westpac and Commonwealth, and Telstra, an Australian mobile telecom company are also in.  Add to that the new IBM Q program offering commercial quantum compute time via API where IBM says “To date users have run more than 300,000 quantum experiments on the IBM Cloud”.

Of all the exotic developments to come out of the lab and into the world of data science none can beat Quantum computing.  Using the state or spin of photons or electrons (qubits) and exploiting ‘spooky entanglement at a distance’ these futuristic computers look like and program like nothing we have seen before.  And they claim to know all the possible solutions to any complex problem instantaneously.


Has the Commercialization Barrier Really Been Breached?

When we look back at other technologies that have moved out of the lab and entered commercialization it’s usually possible to spot at least one key element of the technology that allowed for explosive takeoff. 

In genomics it was Illumina’s low cost gene sequencing devices.  In IoT I would point at Spark.  And in deep learning I would look at the use of GPUs and FPGAs to dramatically accelerate the speed of processing in deep nets.

In Quantum it’s not yet so clear what that single technology driver might be except that within a very short period of time we have gone from one and two Qubit machines to 2,000 Qubit machines that are being sold commercially by D-Wave.  Incidentally, they are on track to double that every year with expectations of a 4,000 Qubit machine next year.  The research firm MarketsandMarkets reports that the Chinese have developed a 24,000 Qubit machine but it’s not available commercially.


What Commercial Applications are Being Adopted?

It appears that while this is going to expand rapidly that current commercial applications are fairly narrow.

Lockheed Martin: For all its diversity Lockheed Martin spends most of its time writing code for pretty exotic application like missile and fire control and space systems.  Yet, on average, half the cost of creating these programs is on verification and validation (V and V).

In 2010 Lockheed became D-Wave’s first commercial customer after testing whether (now 7 year old) Quantum computers could spot errors in complex code.  Even that far back D-Waves earliest machine found the errors in six weeks compared to the many man-months Lockheed Martin’s best engineers had required.

Today, after having upgraded twice to D-Waves newest largest machines Lockheed has several applications, but chief among them is instantly debugging millions of lines of code.

Temporal Defense Systems (TDS):  TDS is using the latest D-Wave 2000Q to build its advanced cyber security system, the Quantum Security Model.  According to James Burrell, TDS Chief Technology Officer and former FBI Deputy Assistant Director this new system will be a wholly new level with real-time security level rating, device-to-device authentication, identification of long-term persistent threats, and detection and prevention of insider threats before network compromise and data theft occurs.

Westpac, Commonwealth, and Telstra:  While the Australians are committed to getting out ahead their approach has been a little different.  Commonwealth recently announced a large investment in a Quantum simulator, while Westpac and Telstra have made sizable ownership investments in Quantum computing companies focused on cyber security.  The Commonwealth simulator is meant to give them a head start on understanding how the technology can be applied to banking which is expected to include at least these areas: 

  • Ultra-strong encryption of confidential data.
  • Continuously run Monte Carlo simulations which provide a bank with a picture of its financial position; currently these are run twice a day.
  • Move anti-money-laundering compliance checks to real time.

The Australian efforts are all focused on a home-grown system being developed at the University of New South Wales which expects to have a commercially available 10 Qubit machine by 2020, about three years.

QuantumX: There is now even an incubator focusing solely on Quantum computing applications called QuantumX with offices in Cambridge and San Francisco.

As for operational business uses these applications are not overwhelmingly diverse but this harkens back to about 2005 when Google was using the first NoSQL DB to improve its internal search algorithms.  Only two years later the world had Hadoop.


Quantum Computing and Deep Learning

Here’s where it gets interesting.  All these anomaly detecting cybersecurity, IV&V, and Monte Carlo simulations are indeed part of data science, but what about deep learning?  Can Quantum computing be repurposed to dramatically speed up Convolutional and Recurrent Neural Nets, and Adversarial and Reinforcement Learning with their multitude of hidden layers that just slows everything down?  As it turns out, yes it can.  And the results are quite amazing.

The standard description of how Quantum computers can be used falls in these three categories:

  1. Simulation
  2. Optimization
  3. Sampling

This might throw you off the track a bit until you realize that pretty much any gradient descent problem can be seen as either an optimization or sampling problem.  So while the press over the last year has been feeding us examples about solving complex traveling salesman optimization, the most important action for data science has been in plotting how Quantum will disrupt the way we approach deep learning.

Exactly the scope of improvement and the methods being employed will have to wait for the next article in this series.


Does Size Count?

You may have read that IBM Q’s Quantum machine available in the cloud via API is 17 Qubits while D-Wave’s is now 2,000 Qubits.  Does this mean IBM’s is tiny by comparison?  Actually no.  IBM and D-Wave use two completely different architectures in their machines so that their compute capability is roughly equal.

D-Wave’s system is based on the concept of quantum annealing and uses a magnetic field to perform qubit operations.

IBM’s system if based on a ‘gate model’ which is considered both more advanced and more complicated.  So when IBM moves from 16 qubits to 17 qubits its computational ability doubles.

There are even other architectures in development.  The Australians for example made an early commitment to do this all in silicon.  Microsoft is believed to be using a new topology and a particle that is yet to be discovered.

Then there’s the issue of reliability.  Factors such as the number and quality of the qubits, circuit connectivity, and error rates all come into play.  IBM has proposed a new metric to measure the computational power of Quantum systems which they call ‘Quantum Volume’ that takes all these factors into account.

So there’s plenty of room for the research to run before we converge on a common sense of what’s best.  That single view of what’s best may be several years off but like Hadoop in 2007, you can buy or rent one today and get ahead of the crowd.

Next time, more specifics on how this could disrupt our path to artificial intelligence and deep learning.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:


Quantum Computing and Deep Learning. How Soon? How Fast?

Parameter Selection in Classification for Financial Market

Parameter Selection in Classification for Financial Market

In practice, we often have to make parameterization choices for a given classifier in order to achieve optimal classification performances; just to name a few examples:

  • Neural Network: e.g., the optimal choice of Activation Functions, # of hidden units
  • Support Vector Machine: e.g., the optimal choice of Kernel Functions
  • Ensemble: e.g., the number of Learning Cycles for Bagging.
  • Discriminant Analysis: e.g., Linear/Quadratic; regularization choices for covariance matrix.
  • Naïve Bayes: e.g., Kernel choices; bandwidth selections.
  • K-nearest Neighbours: e.g., Distance metrics; k in kNN.
  • Decision Tree: e.g., Impurity measure choices; Tree Size Constraint.
  • Logistic Regression

In the following paper, we discuss in details how the parameterization choices are made in the context of Financial Market and the parameters are tuned in order to achieve optimal performance for each classifier mentioned above: the paper is available: https://ssrn.com/abstract=2967184 ; the presentation slide gives a summary: https://ssrn.com/abstract=2973065

Parameter Selection in Classification for Financial Market

How to Solve the New $1 Million Kaggle Problem – Home Value Estimates

How to Solve the New Million Kaggle Problem – Home Value Estimates

A new competition is posted on Kaggle, and the prize is $1.2 Million. Here we provide some help about solving this new problem: improving home value estimates, sponsored by Zillow. 

We have published in the past about home value forecasting, see here, and also .here and here.  In this article, I provide specific advice related to this new competition, to anyone interested in competing or curious about home value forecasting. Additional advice can be obtained by contacting me. More specifically, I provide here high-level advice, rather than the selection of specific statistical models or algorithms, though I also discuss algorithm selection in the last section. I believe that designing sound (usually compound) metrics, assessing data quality and taking it into account, as well as finding external data sources to get a competitive edge and for cross-validation, is even more important than the purely statistical modeling aspects.

1. First, let’s discuss the data 

Where I live (see picture below) all neighboring houses are valued very close to $1.0 million. There are however, significant differences: some lots are much bigger, home sizes vary from 2,600 to 3,100 square feet, and some houses have a great view. These differences are not well reflected in the home values, even though Zillow has access to some of this data.

Regarding my vacation home (picture below) there are huge variations (due mostly to home size and view) and the true (real) value ranges from $500k to $1.5 million. The market is much less fluid. But some spots are erroneously listed at $124k: if you look at the aerial picture below, some lots do not have a home, while in reality the house was built and sold two years ago. This might affect the estimated value of neighboring houses, if Zillow does not discriminate between lots (not built) and homes: you would think that the main factor in Zillow’s model is the value of neighboring homes with known value (e.g. following a recent sale.)

So the first questions are:

  • How do I get more accurate data?
  • How can I rely on Zillow data to further improve Zillow estimates?

We answer these questions in the next section.

2. Better leveraging available data, and getting additional data 

It is possible that Zillow is currently conservative in its home value estimates, putting too much emphasis on the average value in the close neighborhood of the target home, and not enough in the home features, as in the top figure. If this is the case, an easy improvement consists of increasing value differences between adjacent homes, by boosting the importance of lot area and square footage in locations that have very homogeneous Zillow value estimates.  

Getting competitor data about home values, for instance from Trulia, and blending it with Zillow data, could help improve predictions. Such data can be obtained with a web crawler. Indeed, with distributed crawling, one could easily extract data for more than 100 million homes, covering most of the US market. Other data sources to consider includes

  • Demographics, education level, unemployment and household income data per zipcode
  • Foreclosure reports
  • Interest rates if historical data is of any importance (unlikely to be the case here)
  • Crime data and school ratings
  • Weather data, correlated with home values

3. Potential metrics to consider

Many statisticians are just happy to work with the metrics found in the data. However, deriving more complex metrics from the initial features (not to mention obtaining external data sources and thus additional features or ‘control’ features), can prove very beneficial. The process of deriving complex metrics from based metrics is like building complex molecules using basic atoms.

In this case, I suggest computing home values at some aggregated level called bin or bucket Here a bin is possibly a zipcode, as a lot of data is available at the zipcode level. Then for each individual home, compute an estimate based on the bin average, and other metrics such as recent sales price for neighboring homes, trend indicator for the bin in question (using time series analysis), and home features such as school rating, square footage, lot area, view or not, and when the home was built. Crime stats, household income and demographics are already factored in at the bin level. 

Some decay function should be used to lower the impact of sales price older than a few months old, especially in hot markets. If possible, having an idea of the home mix in the neighborhood in question (number of investment properties, family homes, vacation homes, turn over, rentals) can help further refine the predictions. Seasonality is also an important part of the mix. If possible, include property tax data in your model. Differences between listed price and actual price when sold (if available,) can help you compute trends at the zipcode level. Same with increases or decreases in ‘time in market’ (time elapsed between being listed, and being sold or pulled from the market.)

4. Model selection and performance

With just a few (properly binned) features, a simple predictive algorithm such as HDT (Hidden Decision Trees – a combination of multiple decision trees and special regression) can work well, for homes in zipcodes (or buckets of zipcodes) with 200+ homes with recent historical sales price. This should cover most urban areas. For smaller zipcodes, you might consider bundling them by county. The strength of HDT is its robustness and if well executed, its ability to work for a long time period with little maintenance. HDT that work with many features are described here. HDT allows you to easily compute CI (confidence intervals) for your estimate, based on bin (zipcode) values.

However, chances are that performance, to assess the winner among all competitors, will be based on immediate, present data, just like with any POC (proof of concept.) If that is the case, a more unstable model might work well to eventually collect the $1.2 million prize. It is critical to know how performance will be assessed, and to do proper cross-validation no matter what model you use. Cross-validation consists of estimating the value of homes with known (recent) sales price, that are not in your training set, or even better, located in a zipcode outside your training set. It would be a good idea to use at least 50% of all zipcodes in your training set, for cross-validation purposes, assuming you have access to this relatively ‘big data’. And having a substantial proportion of zipcodes with full 5-year worth of historical data (not just sampled homes) would be great, as it would help you assess how well you can make local predictions based on a sample rather than on comprehensive data. If you only have access to a sample, make sure that it is not biased, and discuss the sampling procedure with the data provider. 

It is important to know how the performance metrics (used to determine the winner) handles outlier data or volatile zipcodes. If it is just a straight average of square of errors, you might need a bit of chance to win the competition, in addition to having a strong methodology. Regardless, I would definitely stay away from classic linear models, unless you make them more robust by putting constraints on model parameters (as in the Lasso or Jackknife regression.) 

Finally, as you can judge from this article, it helps to have domain expertise to win such competitions — at least to built scalable solutions that will work for a long time. Hopefully, I shared all my expertise with you, in this article. 

DSC Resources

Popular Articles

How to Solve the New Million Kaggle Problem – Home Value Estimates