ETL vs ELT: Considering the Advancement of Data Warehouses

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL stands for Extract, Transform, Load. It has been a traditional way to manage analytics pipelines for decades. With the advent of modern cloud-based data warehouses, such as BigQuery or Redshift, the traditional concept of ETL is changing towards ELT – when you’re running transformations right in the data warehouse. Let’s see why it’s happening, what it means to have ETL vs ELT, and what we can expect in the future.

ETL is hard and outdated

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization – the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin, because data always evolves with business.

The other problem of ETL is that during the transformation we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data which is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

The more data you have, the longer the transformation process is.

The transformation rules are very complex, and even if you have only a few terabytes of data, loading time can take hours. Given the time it takes to transform and load a big dataset, your end users will almost never get fresh data. “Today” will mean today last week, and “yesterday” – a week ago yesterday. Sometimes it takes several weeks or months to get a new update to rollup.

To summarize some of the cons of ETL:

  • ETL is expensive to implement, especially for small and medium businesses.
  • ETL is expensive to maintain.
  • ETL eliminates access to raw data.
  • ETL is time consuming. Users have to wait for transformation to be finished.

Why have we been doing it for decades?

A prior generation of data warehouses weren’t able to work with the size and complexity of raw data. So, we had to transform data before loading and querying it.

The latest advances in database technologies made warehouses much faster and cheaper. Storage is getting more accessible all the time, some data warehouses even separate pricing for storage and computation. For example, BigQuery storage pricing is quite cheap and you can just store all your raw data there.

The most important changes in recent years to data warehouses, which made the comparison of ETL vs ELT possible:

  • Optimized for analytical operations. Modern analytical warehouses tend to be columnar and optimized for aggregating and processing huge datasets.
  • Cheap storage. No worries about what to store, you can just dump all your raw data into a warehouse.
  • Cloud based. It scales infinitely and on-demand, so you can get the performance you need the moment you need it.

ETL vs ELT: running transformations in a data warehouse

What exactly happens when we switch “L” and “T”? With new, fast data warehouses some of the transformation can be done at query time. But there are still a lot of cases where it would take quite a long time to perform huge calculations. So instead of doing these transformations at query time you can perform them in the warehouse, but in the background, after loading data.

Once raw data is loaded into a warehouse, heavy transformations can be performed. It makes sense to have both real-time and background transformations in the BI platform. Users consume and operate on the business definitions level when querying data, and BI is either performing transformation on-the-fly or querying data already transformed in the background.

This approach gives flexibility and agility for development of a transformation layer.

Software engineers nowadays deploy several times a day and praise continuous delivery. The same principle should be adopted for how we approach a transformation. If metric definition changes or some new data is required, one can easily make this changes in hours, not weeks or months. It is especially valuable for fast-growing startups, where changes happen daily and data teams have to be flexible and agile to keep up with product development and business needs.

As data warehouses advance more and more, I’m sure we will see how query time transformations will entirely replace background transformations. Before that happens, we can run some transformations in the background with ELT. Since they are already SQL based and run in the data warehouses, the final switch would be easy and painless.

ETL vs ELT: Considering the Advancement of Data Warehouses

Event Analytics: How to Define User Sessions with SQL

Event Analytics: How to Define User Sessions with SQL

Quite recently we’ve built event analytics for our team and thought to share this experience with you in this post .

Many of “out-of-the-box” analytics solutions come with automatically defined user sessions. It’s good to start with, but as your company grows, you’ll want to have your own session definitions based on your event data. Analyzing user sessions with SQL gives you flexibility and full control over how metrics are defined for your unique business.  

What is a session and why should I care?

The session is usually defined as a group of interactions that one user executes within a given time frame on your app. Usually, that time frame defaults to 30 minutes. One session includes any events which a user completes on your app before leaving, for example, visiting pages, downloading materials, performing actions.

user sessions

Having sessions in place, we’ll be able to answer questions like:

  • How much time did users spend on the app?
  • What is the bounce rate?
  • What are the most used areas of the app?
  • Where are the users who make target actions coming from?
  • Where are users spending most of their time?

Defining user sessions right

To define a user session we need to have an event table in our database with at least user_id and timestamp.

Usually, you will have a lot of additional information in the event tables, such as event type, name, device info, referrer, and much more. All these properties are going to be very useful to give more context to our sessions and to build things, such as an attribution model.

Note: In this post, we’re going to use window functions, so the following example will not work with MySQL. Also, all these queries are dedicated to Redshift database. 

Step 1

The first step for defining user sessions with SQL is to find out the difference between the timestamp of the current row and the previous one, by user_id. We will use LAG function to accomplish it. This will give us an inactivity time between events.

, DATEDIFF(minutes, lag(timestamp) over (PARTITION BY user_id order by timestamp), timestamp) as inactivity_time
FROM events

inactivity_time for the first event is NULL, since it’s the first event and we don’t have anything before it.

Step 2

We can use inactivity_time to group events into sessions based on 30 minute intervals of inactivity. First, we’ll select all events where inactivity_time is either NULL, or more than 30 minutes, which means it is the first event in the session.

Based on this first event, we define session_start_at,  which is the timestamp of the first event. We use ROW_NUMBER function to calculate session sequence, which is used in session_id.


event.user_id || '-' || row_number() over(partition by event.user_id order by event.timestamp) as session_id
, event.user_id
, event.timestamp as session_start_at
, lead(timestamp) over(partition by event.user_id order by event.timestamp) as next_session_start_at
, e.timestamp
, DATEDIFF(minutes, LAG(e.timestamp) OVER(PARTITION BY e.user_id ORDER BY e.timestamp), e.timestamp) AS inactivity_time
FROM events AS e
) as event
WHERE (event.inactivity_time > 30 OR event.inactivity_time is null)

We can save this table as
sessions Data Mart to use it in our future queries.

Once we have this table, it’s easy to answer user analytics questions we outlined in the beginning. For example, to calculate average session duration we can use the following SQL.

COUNT(*) AS sessions_count,
AVG(duration) AS average_session_duration
SELECT session_id
, DATEDIFF(minutes, MIN(events.timestamp), MAX(events.timestamp)) AS duration
FROM sessions
LEFT JOIN events on events.user_id = sessions.user_id
AND events.timestamp >= events.session_start_at
AND (events.timestamp < sessions.next_session_start_at OR sessions.next_session_start_at is null)

As you see, we join
events table to sessions to map every event to its session. It allows us to get the end of each session, which is the max timestamp of the event within the given session. More complex calculations of session duration could optionally include a window of inactivity as well.

To read original blog, click  here

Event Analytics: How to Define User Sessions with SQL

Using SQL to Estimate Customer Lifetime Value (LTV) without Machine Learning

Using SQL to Estimate Customer Lifetime Value (LTV) without Machine Learning

The Statsbot team estimated LTV 592 times for different clients and business models. 

Customer lifetime value, or LTV, is the amount of money that a customer will spend with your business in their “lifetime,” or at least, in the portion of it that they spend in a relationship with you. It’s an important indicator of how much you can spend on acquiring new customers. For example, your customer acquisition cost (CAC) is $150, and LTV is $600. You would be able to increase the budget to get more people and grow your business. The balance between CAC and LTV allows you to check any business for market survival.

Estimating LTV is a predictive metric which depends on future purchases, based on past patterns, and allows you to see how much risk you are exposed to as a business, and how much you can afford to spend to acquire new clients. At an individual level, it also enables you to figure out who your highest-value customers are likely to be, not just now, but also in the future.

In order to understand how to estimate LTV, it is useful to first think about evaluating a customer’s lifetime value at the end of their relationship with us. So, say a customer stays with us for 12 months, and spends $50 per month.

The revenue that they generated for our business over their lifetime is then $50*12 = $600.  Simple! We can consider the basic definition of LTV as a sum of payments from a specific user.

This same principle applies to the group. If we want to see the average LTV for the group we can look at total spend divided by the number of customers. When we’re talking about estimating LTV for the group, or predictive LTV, we need to take into account how long customers stay with us. To get that, we actually look at it “backwards”: we look at how many customers we lose over time, or the  churn rate.

How do you calculate LTV for SaaS?

At a group level, the basic formula for estimating LTV is this:

Where ARPU is average monthly recurring revenue per user and the churn rate is the rate at which we are losing customers (so the inverse of retention).

This basic formula can be obtained from assumption:

Next Month Revenue = (Current Month Revenue) * (1 – Churn Rate)

Note: When we’re estimating customer lifetime value for SaaS we can neglect Gross Margin, because costs are minor and don’t affect the accuracy of a result. But when we calculate predictive LTV for ecommerce later in this article, we’ll include COGS in our formula.

The main limitation of the LTV formula above is that it assumes that churn is linear over time, as in: we are as likely to lose a customer between the first month of membership of our service and the second, as we are to lose them much later on. Going deeper into the nature of predictive LTV, we can say that it’s a sum of a  geometric series, and linear churn doesn’t look like a straight line (as is shown in many articles about LTV).

In fact, we know that linear churn is usually not the case.

In a flexible subscription model, we lose many people at the very beginning, when they are “testing out” a service, but once they have been with us for a long time, they are less likely to leave.

Ultimately, it depends on the type of contract that exists between customers and the business: for example, annual renewals, where churn is more linear, will result in LTV that is very close to the formula above.

Services which do not have any contracts may lose a high percentage of their new customers, but then churn may slow down.

We can think of this concept graphically:

If the LTV of the group is the area under the line, we can very clearly see that the rate at which we lose customers will impact our LTV estimates very significantly. So we will need to take this into account when we are making our calculations. For a first estimate of LTV, however, it makes sense to go with the simplest formula. After that, we will add levels of complexity.

Extracting ARPU and churn using SQL

In order to make the most basic estimate of LTV, we need to look at our transaction history. Then, we can establish the average revenue per customer as well as the churn rate over the period that we are looking at. For simplicity, I’m going to look at the last year.

You can calculate ARPU in 2 steps:

month_ARPU AS(SELECT     visit_month,      Avg(revenue) AS ARPU 
Datediff(MONTH, ‘2010-01-01’, transaction_date) AS visit_month,
Sum(transaction_size) AS revenue
FROM transactions
WHERE transaction_date > Dateadd(‘year’, -1, CURRENT_DATE)

The results will look like this:

In the case above, that would give us an average monthly spend of $987.33.

Calculating churn rate is a bit more complicated, as we need the percentage of people not returning from one month to the next, taking each group of customers according to the month of their first visit, and then checking if they came back or not in the following month.

The problem is always that, in a transactional database, we have customers’ visits on separate lines, rather than all on the same line.

The way to fix that problem is to join the transactional database to itself, so that we can see a customer’s behavior on one single line.

In order to isolate those who churned, we take the visits from month 1, and left join the visits from month 2 on the cust_id. The lines where visits from month 2 have a cust_id that is null are the ones where the customer has not returned.

WITH monthly_visits AS (SELECT     DISTINCT     Datediff(month, ‘2010-01-01’, transaction_date) AS visit_month, 
FROM transactions
transaction_date > dateadd(‘year’, -1, current_date)),

WHEN cust_type='churn' THEN 1
END)/count(cust_id) AS churn_rate
past_month.visit_month + interval ‘1 month’ AS current_month,
WHEN this_month.cust_id IS NULL THEN 'churn'
ELSE 'retained'
END AS cust_type
monthly_visits past_month
LEFT JOIN monthly_visits this_month ON
AND this_month.visit_month=past_month.visit_month + interval ‘1 month’

Say this gives us a result of 0.1, just for simplicity.

It is a simple calculation, then, to estimate LTV: we have monthly ARPU and monthly churn, so we just divide one by the other!

$987.33/0.1 = $9873.3

As stated earlier, there are limits to this formula, mostly because it makes a series of assumptions that may not hold in the real world. The main one is that retention and churn rates are stable both across cohorts and across time.

Stability across cohorts implies that early adopters of your service act in similar ways to late adopters, while stability across time implies that customers’ likelihood of churning out is the same at the beginning of their relationship with you as it is, for example, 2 years in. Depending on how close to the truth these assumptions are, you may need to revise your LTV estimate downwards.

Using SQL to Estimate Customer Lifetime Value (LTV) without Machine Learning

Machine Learning Algorithms: Which One to Choose for Your Problem

Machine Learning Algorithms: Which One to Choose for Your Problem

When I was beginning my way in data science, I often faced the problem of choosing the most appropriate algorithm for my specific problem. If you’re like me, when you open some article about machine learning algorithms, you see dozens of detailed descriptions. The paradox is that they don’t ease the choice.

In this article, I will try to explain basic concepts and give some intuition of using different kinds of machine learning algorithms in different tasks. At the end of the article, you’ll find the structured overview of the main features of described algorithms.

First of all, you should distinguish 4 types of Machine Learning tasks:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning

Supervised learning

Supervised learning is the task of inferring a function from labeled training data. By fitting to the labeled training set, we want to find the most optimal model parameters to predict unknown labels on other objects (test set). If the label is a real number, we call the task regression. If the label is from the limited number of values, where these values are unordered, then it’s classification.

                                                    Illustration source

Unsupervised learning

In unsupervised learning we have less information about objects, in particular, the train set is unlabeled. What is our goal now? It’s possible to observe some similarities between groups of objects and include them in appropriate clusters. Some objects can differ hugely from all clusters, in this way we assume these objects to be anomalies.


Semi-supervised learning

Semi-supervised learning tasks include both problems we described earlier: they use labeled and unlabeled data. That is a great opportunity for those who can’t afford labeling their data. The method allows us to significantly improve accuracy, because we can use unlabeled data in the train set with a small amount of labeled data.


Reinforcement learning

Reinforcement learning is not like any of our previous tasks because we don’t have labeled or unlabeled datasets here. RL is an area of machine learning concerned with how software agents ought to take actions in some environment to maximize some notion of cumulative reward.


Imagine, you’re a robot in some strange place, you can perform the activities and get rewards from the environment for them. After each action your behavior is getting more complex and clever, so you are training to behave the most effective way on each step. In biology, this is called adaptation to natural environment.

Commonly used Machine Learning algorithms

Now that we have some intuition about types of machine learning tasks, let’s explore the most popular algorithms with their applications in real life.

Linear Regression and Linear Classifier

These are probably the simplest algorithms in machine learning. You have features x1,…xn of objects (matrix A) and labels (vector b). Your goal is to find the most optimal weights w1,…wn and bias for these features according to some loss function, for example, MSE or MAE for a regression problem. In the case of MSE there is a mathematical equation from the least squares method:

In practice, it’s easier to optimize it with gradient descent, that is much more computationally efficient. Despite the simplicity of this algorithm, it works pretty well when you have thousands of features, for example, bag of words or n-gramms in text analysis. More complex algorithms suffer from overfitting many features and not huge datasets, while linear regression provides decent quality.


To prevent overfitting we often use regularization techniques like lasso and ridge. The idea is to add the sum of modules of weights and the sum of squares of weights, respectively, to our loss function. Read the great tutorial on these algorithms at the end of the article.

Logistic regression

Don’t confuse these classification algorithms with regression methods for using “regression” in its title. Logistic regression performs binary classification, so the label outputs are binary. Let’s define P(y=1|x) as the conditional probability that the output y is 1 under the condition that there is given the input feature vector x. The coefficients w are the weights that the model wants to learn.

Since this algorithm calculates the probability of belonging to each class, you should take into account how much the probability differs from 0 or 1 and average it over all objects as we did with linear regression. Such loss function is the average of cross-entropies:

Don’t panic, I’ll make it easy for you. Allow y to be the right answers: 0 or 1, y_pred — predicted answers. If y equals 0, then the first addend under sum equals 0 and the second is the less the closer our predicted y_pred to 0 according to the properties of the logarithm. Similarly, in the case when y equals 1.

What is great about a logistic regression? It takes linear combination of features and applies non-linear function (sigmoid) to it, so it’s a very very small instance of neural network!

Decision trees

Another popular and easy to understand algorithm is decision trees. Their graphics help you see what you’re thinking and their engine requires a systematic, documented thought process.

The idea of this algorithm is quite simple. In every node we choose the best split among all features and all possible split points. Each split is selected in such a way as to maximize some functional. In classification trees we use cross entropy and Gini index. In regression trees we minimize the sum of a squared error between the predictive variable of the target values of the points that fall in that region and the one we assign to it.


We make this procedure recursively for each node and finish when we meet a stopping criteria. They can vary from minimum number of leafs in a node to tree height. Single trees are used very rarely, but in composition with many others they build very efficient algorithms such as Random Forest or Gradient Tree Boosting.


Sometimes you don’t know any labels and your goal is to assign labels according to the features of objects. This is called clusterization task.

Suppose you want to divide all data-objects into k clusters. You need to select random k points from your data and name them centers of clusters. The clusters of other objects are defined by the closest cluster center. Then, centers of the clusters are converted and the process repeats until convergence.

This is the most clear clusterization technique, which still has some disadvantages. First of all, you should know the amount of clusters that we can’t know. Secondly, the result depends on the points randomly chosen at the beginning and the algorithm doesn’t guarantee that we’ll achieve the global minimum of the functional.

There are a range of clustering methods with different advantages and disadvantages, which you could learn in recommended reading.

Principal component analysis (PCA)

Have you ever prepared for a difficult exam on the last night or during the last hours? You have no chance to remember all the information, but you want to maximize information that you can remember in the time available, for example, learning first the theorems that occur in many exam tickets and so on.

Principal component analysis is based on the same idea. This algorithm provides dimensionality reduction. Sometimes you have a wide range of features, probably highly correlated between each other, and models can easily overfit on a huge amount of data. Then, you can apply PCA.

Surprisingly, these vectors are eigenvectors of correlation matrix of features from a dataset.


The algorithm now is clear:

1. We calculate the correlation matrix of feature columns and find eigenvectors of this matrix.

2. We take these multidimensional vectors and calculate the projection of all features on them.

New features are coordinates from a projection and their number depends on the count of eigenvectors, on which you calculate the projection.

Neural networks

I have already mentioned neural networks, when we talked about logistic regression. There are a lot of different architectures that are valuable in very specific tasks. More often, it’s a range of layers or components with linear connections among them and following nonlinearities.

If you’re working with images, convolutional deep neural networks show the great results. Nonlinearities are represented by convolutional and pooling layers, capable of capturing the characteristic features of images.


For working with texts and sequences you’d better choose recurrent neural networks. RNNs contain LSTM or GRU modules and can work with data, for which we know the dimension in advance. Perhaps, one of the most known applications for RNNs is machine translation.


I hope that I could explain to you common perceptions of the most used machine learning algorithms and give intuition on how to choose one for your specific problem. To make things easier for you, I’ve prepared the structured overview of their main features.

Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of features where better algorithms suffer from overfitting.

Logistic regression is the simplest non-linear classifier with a linear combination of parameters and nonlinear function (sigmoid) for binary classification.

Decision trees is often similar to people’s decision process and is easy to interpret. But they are most often used in compositions such as Random forest or Gradient boosting.

K-means is more primal, but a very easy to understand algorithm, that can be perfect as a baseline in a variety of problems.

PCA is a great choice to reduce dimensionality of your feature space with minimum loss of information.

Neural Networks are a new era of machine learning algorithms and can be applied for many tasks, but their training needs huge computational complexity.

For original article, click here


Bayesian Nonparametric Models

Bayesian Nonparametric Models

Bayesian Nonparametrics is a class of models with a potentially infinite number of parameters. High flexibility and expressive power of this approach enables better data modelling compared to parametric methods.

Bayesian Nonparametrics is used in problems where a dimension of interest grows with data, for example, in problems where the number of features is not fixed but allowed to vary as we observe more data. Another example is clustering where the number of clusters is automatically inferred from data.

Our team asked a data scientist, Vadim Smolyakov, to introduce us to Bayesian Nonparametric models. In this article, he describes the Dirichlet process along with associated models and links to their implementations.

Introduction: Dirichlet process K-means

Bayesian Nonparametrics are a class of models for which the number of parameters grows with data. A simple example is non-parametric K-means clustering [1]. Instead of fixing the number of clusters K, we let data determine the best number of clusters. By letting the number of model parameters (cluster means and covariances) grow with data, we are better able to describe the data as well as generate new data given our model.

Of course, to avoid over-fitting, we penalize the number of clusters K via a regularization parameter which controls the rate at which new clusters are created. Thus, our new K-means objective becomes:

In the figure above, we can see the non-parametric clustering, aka Dirichlet-Process (DP) K-Means applied to the Iris dataset. The strength of regularization parameter lambda (right), controls the number of clusters created. Algorithmically, we create a new cluster, every time we discover that a point (x_i) is sufficiently far away from all the existing cluster means:

The resulting update is an extension of the K-means assignment step: we reassign a point to the cluster corresponding to the closest mean or we start a new cluster if the squared Euclidean distance is greater than lambda. By creating new clusters for data points that are sufficiently far away from the existing clusters, we eliminate the need to specify the number of clusters K ahead of time.

Dirichlet process K-means eliminates the need for expensive cross-validation in which we sweep a range of values for K in order to find the optimum point in the objective function. For an implementation of the Dirichlet process K-means algorithm see the following github repo.

Dirichlet process

The Dirichlet process (DP) is a stochastic process used in Bayesian nonparametric models [2]. Each draw from a Dirichlet process is a discrete distribution. For a random distribution G to be distributed according to a DP, its finite dimensional marginal distributions have to be Dirichlet distributed. Let H be a distribution over theta and alpha be a positive real number.

We say that G is a Dirichlet process with base distribution H and concentration parameter alpha if for every finite measurable partition A1,…, Ar of theta we have:

Where Dir is a Dirichlet distribution defined as:

The Dirichlet distribution can be visualized over a probability simplex as in the figure below. The arguments to the Dirichlet distribution (x1, x2, x3) can be interpreted as pseudo-counts.

For example, in the case of (x1, x2, x3) = (2, 2, 2) the Dirichlet distribution (left) has high probability near the middle, in comparison to the (2, 2, 10) case where it concentrates around one of the corners. In the case of (10, 10, 10) we have more observations, and the Dirichlet distribution concentrates more in the middle (since equal number of counts are observed in this case).

The base distribution H is the mean of the DP: E[G(A)] = H(A), whereas the concentration parameter is the inverse variance: VAR[G(A)] = H(A)[1-H(A)] / (1+alpha). Thus, the larger the alpha, the smaller the variance and the DP will concentrate more of its mass around the mean as shown in the figure below [3].

Stick-Breaking Construction

We have seen the utility of Bayesian Nonparametric models is in having a potentially infinite number of parameters. We also had a brief encounter with the Dirichlet process that exhibits a clustering property that makes it useful in mixture modeling where the number of components grows with data.

But how do we generate a mixture model with an infinite number of components?

The answer is a stick-breaking construction [4] that represents draws G from DP(alpha, H) as a weighted sum of atoms (or point masses). It is defined as follows:

The mixture model G consists of an infinite number of weights (pi_k) and mixture parameters (theta_k). The weights are generated by first sampling beta_k from Beta(1, alpha) distribution, where alpha is the concentration parameter and then computing pi_k as in the expression above, while mixture parameters theta_k are sampled from the base distribution H. We can visualize the stick-breaking construction as in the figure below:

Notice that we start with a stick of unit length (left) and in each iteration we break off a piece of length pi_k. The length of the piece that we break off is determined by the concentration parameter alpha. For alpha=5 (middle) the stick lengths are longer and as a result there are fewer significant mixture weights. For alpha=10 (right) the stick lengths are shorter and therefore we have more significant components.

Thus, alpha determines the rate of cluster growth in a non-parametric model. In fact, the number of clusters created is proportional to alpha x log(N) where N is the number of data points.

Dirichlet process Mixture Model (DPMM)

A Dirichlet process mixture model (DPMM) belongs to a class of infinite mixture models in which we do not impose any prior knowledge on the number of clusters K. DPMM models learn the number of clusters from the data using a nonparametric prior based on the Dirichlet process (DP). Automatic model selection leads to computational savings of cross validating the model for multiple values of K. Two equivalent graphical models for a DPMM are shown below:

Here, x_i are observed data points and with each x_i we associate a label z_i that assigns x_i to one of the K clusters. In the left model, the cluster parameters are represented by pi (mixture proportions) and theta (cluster means and covariances) with associated uninformative priors (alpha and lambda).

For ease of computation, conjugate priors are used such as a Dirichlet prior for mixture weights and Normal-Inverse-Wishart prior for a Gaussian component. In the right model, we have a DP representation of DPMM where the mixture distribution G is sampled from a DP (alpha, H) with concentration parameter alpha and base distribution H.

There are many algorithms for learning the Dirichlet process mixture models based on sampling or variational inference. 

The figure above shows DPMM clustering results for a Gaussian distribution (left) and Categorical distribution (right). On the left, we can see the ellipses (samples from posterior mixture distribution) of the DPMM after 100 Gibbs sampling iterations. The DPMM model initialized with 2 clusters and a concentration parameter alpha of 1, learned the true number of clusters K=5 and concentrated around cluster centers.

On the right, we can see the results of clusters of Categorical data, in this case a DPMM model was applied to a collection of NIPS articles. It was initialized with 2 clusters and a concentration parameter alpha of 10. After several Gibbs sampling iterations, it discovered over 20 clusters, with the first 4 shown in the figure. We can see that the word clusters have similar semantic meaning within each cluster and the cluster topics are different across clusters.

Hierarchical Dirichlet process (HDP)

The hierarchical Dirichlet process (HDP) is an extension of DP that models problems involving groups of data especially when there are shared features among the groups. The power of hierarchical models comes from an assumption that the features among groups are drawn from a shared distribution rather than being completely independent. Thus, with hierarchical models we can learn features that are common to all groups in addition to the individual group parameters.

In HDP, each observation within a group is a draw from a mixture model and mixture components are shared between groups. In each group, the number of components is learned from data using a DP prior. The HDP graphical model is summarized in the figure below [5]:

Focusing on HDP formulation in the figure on the right, we can see that we have J groups where each group is sampled from a DP: Gj ~ DP(alpha, G0) and G0 represents shared parameters across all groups which in itself is modeled as a DP: G0 ~ DP(gamma, H). Thus, we have a hierarchical structure for describing our data.

There exists many ways for inferring the parameters of hierarchical Dirichlet processes. One popular approach that works well in practice and is widely used in the topic modelling community is an online variational inference algorithm [6] implemented in gensim.

The figure above shows the first four topics (as a word cloud) for an online variational HDP algorithm used to fit a topic model on the 20newsgroups dataset. The dataset consists of 11,314 documents and over 100K unique tokens. Standard text pre-processing was used, including tokenization, stop-word removal, and stemming. A compressed dictionary of 4K words was constructed by filtering out tokens that appear in less than 5 documents and more than 50% of the corpus.

The top-level truncation was set to T=20 topics and the second level truncation was set to K=8 topics. The concentration parameters were chosen as gamma=1.0 at the top-level and alpha=0.1 at the group level to yield a broad range of shared topics that are concentrated at the group level. We can find topics about autos, politics, and for sale items that correspond to the target labels of the 20newsgroups dataset.

HDP hidden Markov models

The hierarchical Dirichlet process (HDP) can be used to define a prior distribution on transition matrices over countably infinite state spaces. The HDP-HMM is known as an infinite hidden Markov model where the number of states is inferred automatically. The graphical model for HDP-HMM is shown below:

In a nonparametric extension of HMM, we consider a set of DPs, one for each value of the current state. In addition, the DPs must be linked because we want the same set of next states to be reachable from each of the current states. This relates directly to HDP, where the atoms associated with state-conditional DPs are shared.

The HDP-HMM parameters can be described as follows:

Where the GEM notation is used to represent stick-breaking. One popular algorithm for computing the posterior distribution for infinite HMMs is called beam sampling and is described in [7].

Dependent Dirichlet process (DDP)

In many applications, we are interested in modelling distributions that evolve over time as seen in temporal and spatial processes. The Dirichlet process assumes that observations are exchangeable and therefore the data points have no inherent ordering that influences their labelling. This assumption is invalid for modelling temporal and spatial processes in which the order of data points plays a critical role in creating meaningful clusters.

The dependent Dirichlet process (DDP), originally formulated by MacEachern, provides a nonparametric prior over evolving mixture models. A construction of the DDP built on the Poisson process [8] led to the development of the DDP mixture model as shown below:

In the graphical model above we see a temporal extension of the DP process in which a DP at time t depends on the DP at time t-1. This time-varying DP prior is capable of describing and generating dynamic clusters with means and covariances changing over time.


In Bayesian Nonparametric models the number of parameters grows with data. This flexibility enables better modeling and generation of data. We focused on the Dirichlet process (DP) and key applications such as DP K-means (DP-means), Dirichlet process mixture models (DPMMs), hierarchical Dirichlet processes (HDPs) applied to topic models and HMMs, and dependent Dirichlet processes (DDPs) applied to time-varying mixtures.

We looked at how to construct nonparametric models using stick-breaking and examined some of the experimental results. To better understand the Bayesian Nonparametric model, I encourage you to read the literature mentioned in the references and experiment with the code linked throughout the article on challenging datasets!


[1] B. Kulis and M. Jordan, “Revisiting k-means: New Algorithms via Bayesian Nonparametrics ”, ICML, 2012

[2] E. Sudderth, “Graphical Models for Visual Object Recognition and Tracking”, PhD Thesis (Chp 2.5), 2006

[3] A. Rochford, Dirichlet process Mixture Model in PyMC3

[4] J. Sethuraman, “A constructive definition of Dirichlet priors”, Statistica Sinica, 1994.

[5] Y. Teh, M. Jordan, M. Beal and D. Blei, “Hierarchical Dirichlet process”, JASA, 2006

[6] C. Wang, J. Paisley, and D. Blei, “Online Variational Inference for the Hierarchical Dirichlet process”, JMLR, 2011.

[7] J. Van Gael, Y. Saatci, Y. Teh and Z. Ghahramani, “Beam Sampling for the infinite Hidden Markov Model”, ICML 2008

[8] D. Lin, W. Grimson and J. W. Fisher III, “Construction of Dependent Dirichlet processes based on compound Poisson processes”, NIPS 2010

Bayesian Nonparametric Models

DevOps Pipeline for a Machine Learning Project

DevOps Pipeline for a Machine Learning Project

Machine learning is getting more and more popular in applications and software products, from accounting to hot dog recognition apps. When you add machine learning techniques to exciting projects, you need to be ready for a number of difficulties. The Statsbot team asked Boris Tvaroska to tell us how to prepare a DevOps pipeline for an ML based project.

There is no shortage in tutorials and beginner training for data science. Most of them focus on “report” data science. A one-time activity is needed to dig into a data set, clean it, process it, optimize hyperparameters, call .fit() method, and present the results. On the other hand, SaaS or mobile apps are never finished, always changing and upgrading complex sets of algorithms, data, and configuration.

There are two different types of applications:

Machine learning often expands functionality of existing applications — recommendations on a web shop, utterances classification in a chat bot, etc. It means it will be part of the bigger cycle of adding new features, fixing bugs, or other reasons for frequent changes in overall code.

One of my last projects was to build a productivity bot — help knowledge workers in the large company to find and execute the right process. It has a lot of software engineering — integration, monitoring, forwarding the chat to the person, several machine learning components, etc. The main one was intent recognition where we decided to use our own instead of online services (such as,, because we were using a person’s role, location, and more as additional features. As any agile project, we were aiming to get new functionality on a fast and regular basis.

The standard application that manages this fast-moving environment is managed through the pipeline of version control, test, build, and deployment tools (CI/CD cycle). In the best case, it means a fully automated cycle; from a software engineer submitting the code into central version control (for example, through building and testing till deployment to the production environment.

Adding machine learning into this life cycle brings new challenges and changes in a DevOps pipeline.

Testing of statistical results

Traditional unit and integrations testing run on a small set of inputs and expect to produce stable results. The test will either pass or fail. In machine learning, part of the application has statistical results — some of the results will be as expected, some not.

The most important step for applying machine learning to DevOps is to select a method (accuracy, f1, or other), define the expected target, and its evolution.

We have started with 80% accuracy and put the stretch target to move 1 point per week for 10 weeks. The important decision here is what can an end user still accept as help or improvement over application without ML. Our targets were too strict and we were “failing” too many builds, which the client wanted to see in production.

The second important area in ML testing is testing the data set. There is only one simple rule — more is better. We have started training data sets for our bot with a small labeled set and used selected utterances from this data set as testing scenarios. As the data set grows we have better and better test results, but not so much in the production. We have changed the testing to 25% of the overall data set and keep strict separation of training and testing data for all cycles.

Change in results without changing the code

The traditional software engineering approach is to run tests every time there is a change in the codebase. But applying machine learning to DevOps, you need to remember that the quality of machine learning can change with no changes in the code.

The amount and quality of training data are critical for the results.

A testing cycle is usually triggered by a change in codebase. The push to Github starts other tools to build, test, etc. Github is a great tool for managing code, though not so great for managing data sets. Even it cannot host real-sized data sets — the limit on file size is 100M.

Our small data set was still able to fit into the Github repository and as it is pure text it was even able to show differences in each version. As it grew, we moved it into a separate storage account and used a set of python scripts to clean it and split it into a train file and a test file. At the same we had to change the frequency of retraining the classifier as the training time took too long.

Time to train the classifier

The third challenge every machine learning application faces in CI/CD cycle while applying to DevOps is the time needed to train the classifier. An agile process should be fast and able to make changes in a production system as soon as possible. Training of a machine learning classifier can easily take several hours or days. We had to move from retraining on each build to our current approach of retraining weekly to bi-weekly.

There are two basic architectural approaches to address these challenges with a lot of possible shades between them.

Include full processing into your CI/CD cycle

This is where we have started. And it looks to be the easiest and the most straightforward way for a small team and a small dataset.

• Easy fit to existing process
• Full visibility of changes either in code or in data

• Slow process

• A lot of “failing builds” for no apparent reason or change in code
• Complex management of training and testing data sets

Carve out the machine learning part into a separate service

• Clear separation of responsibilities
• Ability to use different languages / frameworks for each part (in our case python for machine learning and javascript for bot)

• Risk of premature definition of services

Final thoughts

Our experience shows that there is no one-size-fits-all approach to combining traditional software engineering and machine learning in one project.

At the early stages of the project when our data structure and features were changing often, the most simple approach was the best. Quick and dirty is sometimes the best way.

It did allow us to start the feedback cycle and evolve the model. The important thing is not to forget the technical debt you are building into the software. As our data set grew we had to start moving parts out of core service.

The second step was to separate machine learning into independent services. By the time our build/test run went for 6 hours we had to move it out even though the rest of the software was not ready to separate into a microservice architecture. The main driver for the separation of machine learning is the size of the data set.

What is your experience with applying machine learning to DevOps?

DevOps Pipeline for a Machine Learning Project