The DeepMind Strategy – How AI is Revolutionizing Business Models

The DeepMind Strategy – How AI is Revolutionizing Business Models

This article was written by Francesco Corea.

Image Credit: Sergey Tarasov/Shutterstock

I. Overview:

AI is introducing radical innovation even in the way we think about business, and the aim of this section is indeed to categorize different AI companies and business models.

It is possible to look at the AI sector as really similar in terms of business models to the biopharma industry: expensive and long R&D; long investment cycle; low-probability enormous returns; concentration of funding toward specific phases of development. There are anyway two differences between those two fields: the experimentation phase, that is much faster and painless for AI, and the (absent) patenting period, which forces AI to continuously evolve and to use alternative revenue models (e.g., freemium model).

II. The DeepMind Strategy and the Open Source Model:

If we look from the incumbents’ side, we might notice two different nuances in their business models evolution. First, the growth model is changing. Instead of competing with emerging startups, the biggest incumbents are pursuing an aggressive acquisition strategy.

I named this new expansion strategy the “DeepMind strategy” because it has become extremely common after the acquisition of DeepMind operated by Google.

The companies are purchased when they are still early stage, in their first 1–3 years of life, where the focus is more on people and pure technological advancements rather than revenues (AI is the only sector in which the pure team value exceeds the business one). They maintain elements of their original brand and retain the entire existing team (“acqui-hire”). Companies maintain full independence, both physically speaking — often they keep in place their original headquarters — as well as operationally. This independence is so vast to allow them to pursue acquisition strategies in turn (DeepMind bought Dark Blue Labs and Vision Factory in 2014). The parent company uses the subsidiary services and integrates rather than replaces the existing business (e.g., Google Brain and Deepmind).

It seems then that the acquisition costs are much lower than the opportunity cost of leaving around many brains, and it works better to (over)pay for a company today instead of being cutting out a few years later. In this sense, these acquisitions are pure real option tools: they represent future possible revenues and future possible underlying layers where incumbents might end up building on top of.

The second nuance to point out is the emerging of the open source model in the AI sector, which is quite difficult to reconcile with the traditional SaaS model. Many of the cutting-edge technologies and algorithms are indeed provided for free and can be easily downloaded. So why incumbents are paying huge money and startups are working so hard to give all away for free?

Well, there are a series of considerations to be made here. First, AI companies and departments are driven by scientists and academics, and their mindset encourages sharing and publicly presenting their findings. Second, open sourcing raises the bar of the current state of the art for potential competitors in the field: if it is publicly noted what you can build with TensorFlow, another company that wants to take over Google should publicly prove to provide at least what TensorFlow allows. It also fosters use cases that were not envisioned at all by the providing company and set up those tools as underlying technology everything should be built on top of which.

III. Implications of Open-Source:

Releasing free software that does not require presence of high-tech hardware is also a great way for 6 things to happen, as discussed below.

  1. Lowering the adoption barrier to entry, and get traction on products that would not have it otherwise.
  2. Troubleshooting, because many heads are more efficient in finding and fixing bugs as well as looking at things from a different perspective.
  3. (Crowd) validating, because often the mechanism, rationales, and implications might not be completely clear.
  4. Shortening the product cycle, because from the moment a technical paper is published or a software release it takes weeks to have augmentations of that product.
  5. To create a competitive advantage in data creation/collection, in attracting talents, and creating additive products based on that underlying technology.
  6. More importantly, to create a data network effect, i.e., a situation in which more (final or intermediate) users create more data using the software, which in turn make the algorithms smarter, then the product better, and eventually attract more users.

To read the full original article click here. For more AI related articles on DSC click here.

DSC Resources

Popular Articles

The DeepMind Strategy – How AI is Revolutionizing Business Models

Limits of Linear Models for Forecasting

Limits of Linear Models for Forecasting

This article was written by Blaine Bateman

In this post, I will demonstrate the use of nonlinear models for time series analysis, and contrast to linear models. I will use a (simulated) noisy and nonlinear time series of sales data, use multiple linear regression and a small neural network to fit training data, then predict 90 days forward.  I implemented all of this in R, although it could be done in a number of coding environments. (Specifically, I used R 3.4.2 in RStudio 1.1.183 in Windows 10).

It is worth noting that much of what is presented in the literature and trade media regarding neural networks concerns classification problems. Classification means there are a finite number of correct answers given a set of inputs. In image recognition, an application well served by neural networks, classification would include dog/not dog. This is a simplistic example and such methods can predict a very large number of classes, such as reading addresses on mail with machine vision and automatically sorting for delivery. In this post, I am exploring models that produce continuous outputs instead of a finite number of discrete outputs. Neural networks and other methods are very applicable to continuous prediction.

Another point to note is that there are many empirical methods available for time series analysis. For example, ARIMA (auto regressive integrative moving average) and related methods use a combination of time lagged data to predict the future. Often these approaches are used for relatively short-term prediction. In this post, I want to use business knowledge and data in a model to predict future sales. My view is that such models are more likely to behave well over time, and can be adapted for business changes by adding or removing factors deemed newly important or found unimportant.

The linear regression method is available in base R using the lm() function. For the neural network, I used the RPROP Algorithm, published by Martin Riedmiller and Heinrich Braun of the University of Karlsruhe. RPROP is a useful variation of neural network modeling that, in some forms, automatically finds the appropriate learning rate. 

For my purposes, I mainly use the rprop+ version of the algorithm, very nicely implemented by Stefan Fritsch & Frauke Guenther with contributors Marc Suling & Sebastian M. Mueller. The implementation is available as a library as well as source code. rprop+ appears to be quite resilient in that it easily converges without a lot of hyperparameter tuning. This is important to my point here which is implementation of nonlinear modeling isn’t necessarily more difficult than linear models.

The data are shown here:

Figure 1. Data used in this analysis. Data are synthesized time series data representing sales. Periodic spikes are noted along with long term nonlinear behavior.

The data are a simulated time series of sales data, which has spikes at quarterly and smaller periods, as well as longer term variations. There is about 3 ¼ years of data at daily granularity, and I want to test the potential to use the first 3 years as training data, then predict another 90 days in the future. The business case is that it is believed there are various factors that influence sales, some internal to our business and some external. We have a set of 8 factors, one of which is past sales, the remaining being market factors (such as GDP, economic activity, etc.) and internal data (such as sales pipeline, sales incentive programs, new product introductions (NPI), etc.). The past sales are used with phasing of one year, at which it is arrived by noting there are annual business cycles. (Note: there are many more rigorous ways to determine phasing; I’ll address that in another post.) These factors are labeled a, c, f, g, h, i, j, and k in what follows. The sales values are labeled as Y. For each model then, the 1210 daily values of the 8 factors are provided, plus the actual sales results, and build a model that fits the historical data as well as possible. 

Linear regression

Using the lm() function in R, I fit a linear model to the data. Linear means that each factor is multiplied by a coefficient (determined by the fit process) and these are simply added together to estimate the resulting sales. The equation looks as follows:

Y = C1*a + C2*c + C3*f + C4*g + C5*h + C6*I + C7*j + C8*k + C0

where, as noted, Y is the sales. Note that C0 is a constant value that is also determined by the regression modeling. Once I have such as model, sales are predicted by simply multiplying an instance of the factors by the coefficients and summing to get a prediction for sales. To predict future sales, values of the factors are needed. If the factors are not time lagged from the sales, then, for example, a forecast for GDP or the future NPI plan would be needed. Depending on the specific case, all the factors might be time lagged values and future sales can be predicted from known data. In some cases, a forecast is needed for some of the factors. These details are not important for the evaluation here.

Neural network tests

As a first step, I will use a simple neural network that has 8 input nodes (one for each factor) plus the “bias” node (The bias node is motivated by understanding the behavior of a single unit, also known as a perceptron. Including a bias allows a single perceptron to mimic the entire range of logical operators (like AND, OR, XOR, etc.) and thus is usually included in the network architecture). These 9 nodes are fed into a single hidden layer of 3 nodes, which, along with a bias node, are fed into the output node. The network is shown here:

Figure 2. Simple neural network with one hidden layer comprising three nodes. Values for the eight predictors are presented at the left, the output of those nodes are multiplied by weights (determined by the modeling process) and sent to the hidden layer. The values shown are the weights determined using the training data (see below).

There are (at least) two ways that a neural network representation can model nonlinear behavior. First, every node from a given layer is connected to every node of the next layer. These connections are multiplied by weights before feeding into the next node, where the weights are determined in the modeling process. These cross connections can model interactions between factors that the linear model does not. In addition, a typical neural network node uses a nonlinear function to determine the node output from the node inputs. These functions are often called activation functions, another reference to organic neuron behavior. A common activation function is the sigmoid or logistic function. 


The two cases were run as indicated, with the results summarized in the following charts. In each case, the model was trained using the training data, excluding the last 90 days to be used as test data. Predictions were then made out to 90 days in the future from the end of the training data.

Figure 3. Training and test data overlaid with model predictions. The linear model exhibits short term features not representative of the data. The nonlinear model appears to perform better.

Figure 4. The same data only over the test/prediction range.

The linear model produces a stair-stepped output. The spikes are exaggerated relative to the original data. The nonlinear model appears to do a better job. On the right, the chart is zoomed in on only the prediction period, and the differences in the model performance are clearer.

To read the full original article click here. For more linear model related articles on DSC click here.

DSC Resources

Popular Articles

Limits of Linear Models for Forecasting

PCA with Rubner-Tavan Networks

PCA with Rubner-Tavan Networks

This article is written by Giuseppe Bonaccorso.

One of the most interesting effects of PCA (Principal Component Analysis) is to decorrelate the input covariance matrix C, by computing the eigenvectors and operating a base change using a matrix V:

The eigenvectors are sorted in descending order considering the corresponding eigenvalue, therefore Cpca is a diagonal matrix where the non-null elements are λ1 >= λ2 >= λ3 >= … >= λn. By selecting the top p eigenvalues, it’s possible to operate a dimensionality reduction by projecting the samples in the new sub-space determined by the p top eigenvectors (it’s possible to use Gram-Schmidt orthonormalization if they don’t have a unitary length). The standard PCA procedure works with a bottom-up approach, obtaining the decorrelation of C as a final effect, however, it’s possible to employ neural networks, imposing this condition as an optimization step. One the most effective model has been proposed by Rubner and Tavan (and it’s named after them). Its generic structure is:

Where we suppose that N (input-dimensionality) << M (output-dimensionality). The output of the network can be computed as:

where V (n × n) is a lower-triangular matrix with all diagonal elements to 0 and W has a shape (n × m). Moreover, it’s necessary to store the y(t) in order to compute y(t+1). This procedure must be repeated until the output vector has been stabilized. In general after k < 10 iterations, the modifications are under a threshold of 0.0001, however, it’s important to check this value in every real application.

The training process is managed with two update rules:

The first one is Hebbian based on the Oja’s rule, while the second is anti-Hebbian because its purpose is to reduce the correlation between output units. In fact, without the normalization factor, the update is in the form dW = -αy(i)y(k), so to reduce the synaptic weight when two output units are correlated (same sign).

To read the full original article, with source code, and access other articles by the same author, click here. For more PCA related articles on DSC click here.

DSC Resources

Popular Articles

PCA with Rubner-Tavan Networks

Time Series Analysis with Generalized Additive Models

Time Series Analysis with Generalized Additive Models

This article is by  

Whenever you spot a trend plotted against time, you would be looking at a time series. The de facto choice for studying financial market performance and weather forecasts, time series are one of the most pervasive analysis techniques because of its inextricable relation to time—we are always interested to foretell the future.

Temporal Dependent Models

One intuitive way to make forecasts would be to refer to recent time points. Today’s stock prices would likely be more similar to yesterday’s prices than those from five years ago. Hence, we would give more weight to recent than to older prices in predicting today’s price. These correlations between past and present values demonstrate temporal dependence, which forms the basis of a popular time series analysis technique called ARIMA (Autoregressive Integrated Moving Average). ARIMA accounts for both seasonal variability and one-off ‘shocks’ in the past to make future predictions.

However, ARIMA makes rigid assumptions. To use ARIMA, trends should have regular periods, as well as constant mean and variance. If, for instance, we would like to analyze an increasing trend, we have to first apply a transformation to the trend so that it is no longer increasing but stationary. Moreover, ARIMA cannot work if we have missing data.

To avoid having to squeeze our data into a mould, we could consider an alternative such as neural networks. Long short-term memory (LSTM) networks are a type of neural networks that builds models based on temporal dependence. While highly accurate, neural networks suffer from a lack of interpretability—it is difficult to identify the model components that lead to specific predictions.

General Additive Models

Besides using correlations between values from similar time points, we could take a step back to model overall trends. A time series could be seen as a summation of individual trends. Take, for instance, google search trends for persimmons, a type of fruit.

From the Figure 1, we can infer that persimmons are probably seasonal. With its supply peaking in November, grocery shoppers might be prompted to google for nutrition facts or recipes for persimmons.

Figure 1. Seasonal trend in google searches for ‘persimmon

Moreover, google searches for persimmons are also growing more frequent over the years.

Figure 2. Overall growth trend in google searches for ‘persimmon

Therefore, google search trends for persimmons could well be modeled by adding a seasonal trend to an increasing growth trend, in what’s called a generalized additive model (GAM).

The principle behind GAMs is similar to that of regression, except that instead of summing effects of individual predictors, GAMs are a sum of smooth functions. Functions allow us to model more complex patterns, and they can be averaged to obtain smoothed curves that are more generalizable.

Because GAMs are based on functions rather than variables, they are not restricted by the linearity assumption in regression that requires predictor and outcome variables to move in a straight line. Furthermore, unlike in neural networks, we can isolate and study effects of individual functions in a GAM on resulting predictions.

In this tutorial, we will:

  1. See an example of how GAM is used: saving daylight time.
  2. Learn how functions in a GAM are identified through backfitting.
  3. Learn how to validate a time series model.

To read the full original article click here. For more time series related articles on DSC click here.

DSC Resources

Popular Articles

Time Series Analysis with Generalized Additive Models

Implementing a Neural Network from Scratch in Python – An Introduction

Implementing a Neural Network from Scratch in Python – An Introduction

This article was written by Denny Britz.

In this post we will implement a simple 3-layer neural network from scratch. We won’t derive all the math that’s required, but I will try to give an intuitive explanation of what we are doing. I will also point to resources for you read up on the details.

Here I’m assuming that you are familiar with basic Calculus and Machine Learning concepts, e.g. you know what classification and regularization is. Ideally you also know a bit about how optimization techniques like gradient descent work. But even if you’re not familiar with any of the above this post could still turn out to be interesting 😉

But why implement a Neural Network from scratch at all? Even if you plan on using Neural Network libraries like PyBrain in the future, implementing a network from scratch at least once is an extremely valuable exercise. It helps you gain an understanding of how neural networks work, and that is essential for designing effective models.

One thing to note is that the code examples here aren’t terribly efficient. They are meant to be easy to understand. In an upcoming post I will explore how to write an efficient Neural Network implementation using Theano.

Generating a dataset

Let’s start by generating a dataset we can play with. Fortunately, scikit-learn has some useful dataset generators, so we don’t need to write the code ourselves. We will go with the make_moons function.

# Generate a dataset and plot it
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y,

The dataset we generated has two classes, plotted as red and blue points. You can think of the blue dots as male patients and the red dots as female patients, with the x- and y- axis being medical measurements.

Our goal is to train a Machine Learning classifier that predicts the correct class (male of female) given the x- and y- coordinates. Note that the data is not linearly separable, we can’t draw a straight line that separates the two classes. This means that linear classifiers, such as Logistic Regression, won’t be able to fit the data unless you hand-engineer non-linear features (such as polynomials) that work well for the given dataset.

In fact, that’s one of the major advantages of Neural Networks. You don’t need to worry about feature engineering. The hidden layer of a neural network will learn features for you.

Logistic Regression:

To demonstrate the point let’s train a Logistic Regression classifier. It’s input will be the x- and y-values and the output the predicted class (0 or 1). To make our life easy we use the Logistic Regression class from scikit-learn.

# Train the logistic rgeression classifier
clf = sklearn.linear_model.LogisticRegressionCV(), y)
# Plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")

The graph shows the decision boundary learned by our Logistic Regression classifier. It separates the data as good as it can using a straight line, but it’s unable to capture the “moon shape” of our data.

Training a Neural Network:

Let’s now build a 3-layer neural network with one input layer, one hidden layer, and one output layer. The number of nodes in the input layer is determined by the dimensionality of our data, 2. Similarly, the number of nodes in the output layer is determined by the number of classes we have, also 2. (Because we only have 2 classes we could actually get away with only one output node predicting 0 or 1, but having 2 makes it easier to extend the network to more classes later on). The input to the network will be x- and y- coordinates and its output will be two probabilities, one for class 0 (“female”) and one for class 1 (“male”). It looks something like this:

We can choose the dimensionality (the number of nodes) of the hidden layer. The more nodes we put into the hidden layer the more complex functions we will be able fit. But higher dimensionality comes at a cost. First, more computation is required to make predictions and learn the network parameters. A bigger number of parameters also means we become more prone to overfitting our data.

How to choose the size of the hidden layer? While there are some general guidelines and recommendations, it always depends on your specific problem and is more of an art than a science. We will play with the number of nodes in the hidden later later on and see how it affects our output.

We also need to pick an activation function for our hidden layer. The activation function transforms the inputs of the layer into its outputs. A nonlinear activation function is what allows us to fit nonlinear hypotheses. Common choices for activation functions are tanh, the sigmoid function, or ReLUs. We will use tanh, which performs quite well in many scenarios. A nice property of these functions is that their derivate can be computed using the original function value. For example, the derivative of tanh x is 1-tanh^2 x. This is useful because it allows us to compute tanh x  once and re-use its value later on to get the derivative.

Because we want our network to output probabilities the activation function for the output layer will be the softmax, which is simply a way to convert raw scores to probabilities. If you’re familiar with the logistic function you can think of softmax as its generalization to multiple classes.

How our network makes predictions:

Our network makes predictions using forward propagation, which is just a bunch of matrix multiplications and the application of the activation function(s) we defined above. If x is the 2-dimensional input to our network then we calculate our prediction hat{y} (also two-dimensional) as follows:

begin{aligned} z_1 & = xW_1 + b_1  a_1 & = tanh(z_1)  z_2 & = a_1W_2 + b_2  a_2 & = hat{y} = mathrm{softmax}(z_2) end{aligned}

z_i is the input of layer i and a_i is the output of layer i after applying the activation function. W_1, b_1, W_2, b_2are parameters of our network, which we need to learn from our training data. You can think of them as matrices transforming data between layers of the network. Looking at the matrix multiplications above we can figure out the dimensionality of these matrices. If we use 500 nodes for our hidden layer then W_1 in mathbb{R}^{2times500}b_1 in mathbb{R}^{500}W_2 in mathbb{R}^{500times2}b_2 in mathbb{R}^{2}. Now you see why we have more parameters if we increase the size of the hidden layer.

Learning the Parameters

Learning the parameters for our network means finding parameters (W_1, b_1, W_2, b_2) that minimize the error on our training data. But how do we define the error? We call the function that measures our error the loss function. A common choice with the softmax output is the categorical cross-entropy loss (also known as negative log likelihood). If we have N training examples and C classes then the loss for our prediction hat{y} with respect to the true labels y is given by:

begin{aligned} L(y,hat{y}) = - frac{1}{N} sum_{n in N} sum_{i in C} y_{n,i} loghat{y}_{n,i} end{aligned}

The formula looks complicated, but all it really does is sum over our training examples and add to the loss if we predicted the incorrect class. The further away the two probability distributions y (the correct labels) and hat{y} (our predictions) are, the greater our loss will be. By finding parameters that minimize the loss we maximize the likelihood of our training data.

We can use gradient descent to find the minimum and I will implement the most vanilla version of gradient descent, also called batch gradient descent with a fixed learning rate. Variations such as SGD (stochastic gradient descent) or minibatch gradient descent typically perform better in practice. So if you are serious you’ll want to use one of these, and ideally you would also decay the learning rate over time. 

As an input, gradient descent needs the gradients (vector of derivatives) of the loss function with respect to our parameters: frac{partial{L}}{partial{W_1}}frac{partial{L}}{partial{b_1}}frac{partial{L}}{partial{W_2}}frac{partial{L}}{partial{b_2}}. To calculate these gradients we use the famous backpropagation algorithm, which is a way to efficiently calculate the gradients starting from the output. I won’t go into detail how backpropagation works, but there are many excellent explanations floating around the web.

To read the full original article with source code, click here. For more neural network related articles on DSC click here.

DSC Resources

Popular Articles

Implementing a Neural Network from Scratch in Python – An Introduction

Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

This article was written by .

You might go to a cutting-edge machine learning research conference like NIPS hoping to find some mathematical insight that will help you take your deep learning system’s performance to the next level. Unfortunately, as Andrew Ng reiterated to a live crowd of 1,000+ attendees this past Monday, there is no secret AI equation that will let you escape your machine learning woes. All you need is some rigor, and much of what Ng covered is his remarkable NIPS 2016 presentation titled “The Nuts and Bolts of Building Applications using Deep Learning” is not rocket science. Today we’ll dissect the lecture and Ng’s key takeaways. Let’s begin.

Figure 1. Andrew Ng delivers a powerful message at NIPS 2016.
Andrew Ng and the Lecture:
Andrew Ng’s lecture at NIPS 2016 in Barcelona was phenomenal — truly one of the best presentations I have seen in a long time. In a juxtaposition of two influential presentation styles, the CEO-style and the Professor-style, Andrew Ng mesmerized the audience for two hours. Andrew Ng’s wisdom from managing large scale AI projects at Baidu, Google, and Stanford really shows. In his talk, Ng spoke to the audience and discussed one of they key challenges facing most of the NIPS audience — how do you make your deep learning systems better? Rather than showing off new research findings from his cutting-edge projects, Andrew Ng presented a simple recipe for analyzing and debugging today’s large scale systems. With no need for equations, a handful of diagrams, and several checklists, Andrew Ng delivered a two-whiteboards-in-front-of-a-video-camera lecture, something you would expect at a group research meeting. However, Ng made sure to not delve into Research-y areas, likely to make your brain fire on all cylinders, but making you and your company very little dollars in the foreseeable future.
Money-making deep learning vs Idea-generating deep learning:
Andrew Ng highlighted the fact that while NIPS is a research conference, many of the newly generated ideas are simply ideas, not yet battle-tested vehicles for converting mathematical acumen into dollars. The bread and butter of money-making deep learning is supervised learning with recurrent neural networks such as LSTMs in second place. Research areas such as Generative Adversarial Networks (GANs), Deep Reinforcement Learning (Deep RL), and just about anything branding itself as unsupervised learning, are simply Research, with a capital R. These ideas are likely to influence the next 10 years of Deep Learning research, so it is wise to focus on publishing and tinkering if you really love such open-ended Research endeavours. Applied deep learning research is much more about taming your problem (understanding the inputs and outputs), casting the problem as a supervised learning problem, and hammering it with ample data and ample experiments.

“It takes surprisingly long time to grok bias and variance deeply, but people that understand bias and variance deeply are often able to drive very rapid progress.” 

–Andrew Ng 

The 5-step method of building better systems:
Most issues in applied deep learning come from a training-data / testing-data mismatch. In some scenarios this issue just doesn’t come up, but you’d be surprised how often applied machine learning projects use training data (which is easy to collect and annotate) that is different from the target application. Andrew Ng’s discussion is centered around the basic idea of bias-variance tradeoff. You want a classifier with a good ability to fit the data (low bias is good) that also generalizes to unseen examples (low variance is good). Too often, applied machine learning projects running as scale forget this critical dichotomy. Here are the four numbers you should always report:
  • Training set error
  • Testing set error
  • Dev (aka Validation) set error
  • Train-Dev (aka Train-Val) set error
Andrew Ng suggests following the following recipe:
Figure 2. Andrew Ng’s “Applied Bias-Variance for Deep Learning Flowchart” 
for building better deep learning systems.

Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016