Data Science Simplified Part 9: Interactions and Limitations of Regression Models
In the last few blog posts of this series discussed regression models at length. Fernando has built a multivariate regression model. The model takes the following shape:
price = -55089.98 + 87.34 engineSize + 60.93 horse power + 770.42 width
The model predicts or estimates price (target) as a function of engine size, horse power, and width (predictors).
Recall that multivariate regression model assumes independence between the independent predictors. It treats horsepower, engine size, and width as if they are not related.
In practice, variables are rarely independent.
What if there are relations between horsepower, engine size and width? Can these relationships be modeled?
This blog post will address this question. It will explain the concept of interactions.
The independence between predictors means that if one predictor changes, it has the impact on the target. This impact has no relation with existence or changes to other predictors. The relationship between the target and the predictors is additive and linear.
Let us take an example to illustrate it. Fernando’s equation is:
price = -55089.98 + 87.34 engine size + 60.93 horse power + 770.42 width
It is interpreted as a unit change to the engine size changes the price by $87.34.
This interpretation never takes into consideration that engine size may be related to the width of the car.
Can’t it be the case that wider the car, bigger the engine?
A third predictor captures the interaction between engine and width. This third predictor is called as the interaction term.
With the interaction term between engine size and the width, the regression model takes the following shape:
price = β0 + β1. engine size + β2. horse power + β3. width + β4. (engine size . width)
The part of the equation (β1. engine size + β3. width) is called as the main effect.
The term engine size x width is the interaction term.
How does this term capture the relation between engine size and width? We can rearrange this equation as:
price = β0 + (β1 + β4. width) engine size + β2. horse power + β3. width
Now, β4 can be interpreted as the impact on the engine size if the width is increased by 1 unit.
Fernando inputs these data into his statistical package. The package computes the parameters. The output is the following:
The equation becomes:
price = 51331.363–1099.953 x engineSize + 45.896 x horsePower — 744.953 x width + 17.257 x engineSize:width
price = 51331.363 — (1099.953–17.257 x width)engineSize + 45.896 x horsePower — 744.953 x width
Let us interpret the coefficients:
- The engine size, horse power and engine size: width (the interaction term) are significant.
- The width of the car is not significant.
- Increasing the engine size by 1 unit, reduces the price by $1099.953.
- Increasing the horse power by 1 unit, increases the price by $45.8.
- The interaction term is significant. This implies that the true relationship is not additive.
- Increasing the engine size by 1 unit, also increases the price by (1099.953–17.257 x width).
- The adjusted r-squared on test data is 0.8358 => the model explains 83.5% of variation.
Note that the width of the car is not significant. Then does it make sense to include it in the model?
Here comes a principle called as the hierarchical principle.
Hierarchical Principle: When interactions are included in the model, the main effects needs to be included in the model as well. The main effects needs to be included even if the individual variables are not significant in the model.
Fernando now runs the model and tests the model performance on test data.
The model performs well on the testing data set. The adjusted r-squared on test data is 0.8175622 => the model explains 81.75% of variation on unseen data.
Fernando now has an optimal model to predict the car price and buy a car.
Limitations of Regression Models
Regression models are workhorse of data science. It is an amazing tool in a data scientist’s toolkit. When employed effectively, they are amazing at solving a lot of real life data science problems. Yet, they do have their limitations. Three limitations of regression models are explained briefly:
Linear regression models assume linearity between variables. If the relationship is not linear then the linear regression models may not perform as expected.
Practical Tip: Use transformations like log to transform a non-linear relationship to a linear relationship
Collinearity refers to a situation where two predictor variables are correlated with each other. When there a lot of predictors and these predictors are correlated to each other, it is called as multi-collinearity. If the predictors are correlated with each other then the impact of a specific predictor on the target is difficult to be isolated.
Practical Tip: Make the model simpler by choosing predictors carefully. Limit choosing too many correlated predictors. Alternately, use techniques like principal components that create new uncorrelated variables.
Impact of outliers:
An Outlier is a point which is far from the value predicted by the model. If there are outliers in the target variable, the model is stretched to accommodate them. Too much model adjustment is done for a few outlier points. This makes the model skew towards the outliers. It doesn’t do any good in fitting the model for the majority.
Practical Tip: Remove the outlier points for modeling. If there are too many outliers in the target, there may be a need for multiple models.
It has been quite a journey. In the last few blog posts, simple linear regression model was explained. Then we dabbled in multivariate regression models. Model selection methods were discussed. Treating qualitative variables and interaction were discussed as well.
In the next post of this series, we will discuss another type of supervised learning model: Classification.
Data Science Simplified Part 9: Interactions and Limitations of Regression Models
Data Science Simplified Part 7: Log-Log Regression Models
In the last few blog posts of this series, we discussed simple linear regression model. We discussed multivariate regression model and methods for selecting the right model.
Fernando has now created a better model.
price = -55089.98 + 87.34 engineSize + 60.93 horse power + 770.42 width
Fernando contemplates the following:
- How can I estimate the price changes using a common unit of comparison?
- How elastic is the price with respect to engine size, horse power, and width?
In this article will address that question. This article will elaborate about Log-Log regression models.
To explain the concept of the log-log regression model, we need to take two steps back. First let us understand the concept of derivatives, logarithms, exponential. Then we need understand the concept of elasticity.
Let us go back to high school math. Meet derivatives. One of most fascinating concepts taught in high school math and physics.
Derivate is a way to represent change – the amount by which a function is changing at one given point.
A variable y is a function of x. Define y as:
y = f(x)
We apply derivative on y with respect to x and represent it as follows:
dy/dx = df(x)/dx = f’(x)
This means the following:
- The change in y with respect to change in x i.e. How much will y change if x changes?
Isn’t it that Fernando wants? He wants to know the change in price (y) with respect to changes in other variables (cityMpg and highwayMpg).
Recall that the general form of a multivariate regression model is the following:
y = β0 + β1.x1 + β2.x2 + …. + βn.xn + ε
Let us say that Fernando builds the following model:
price = β0 + β1 . engine size i.e. expressing price as a function of engine size.
Fernando takes the derivative of price with respect to engine size. Shouldn’t he be able to express the change in price with respect to changes in engine size?
Alas, it is not that simple. The linear regression model assumes a linear relationship. The Linear relationship is defined as:
y = mx + c
If the derivative of y over x is computed, it gives the following:
dy/dx = m . dx/dx + dc/dx
- The change of something with respect to itself is always 1 i.e. dx/dx = 1
- The change of a constant with respect to anything is always 0. That is why it is a constant. It won’t change. i.e. dc/dx = 0
The equation now becomes:
dy/dx = m
Applying derivate to price on engine size will yield nothing but the coefficient of engine size.
There has to be a way to transform it. Here come two more mathematical characters. Meet exponential and logarithms.
Now let us look at exponential. This character is again a common character in high school math. An exponential is a function that has two operators. A base (b) and an exponent (n). It is defined as bn. it takes the form:
f(x) = bx
The base can be any positive number. Again Euler’s number (e) is a common base used in statistics.
Geometrically, an exponential relationship has following structure:
- An increase in x doesn’t yield a corresponding increase in y. Until a threshold is reached.
- After the threshold, the value of y shoots up rapidly for a small increase in x.
The logarithm is an interesting character. Let us only understand its personality applicable for regression models. The fundamental property of a logarithm is its base. The typical base of the logarithm is 2, 10 or e.
Let us take an example:
- How many 2s do we multiply to get 8? 2 x 2 x 2 = 8 i.e. 3
- This can also be expressed as: log2(8) = 3
The logarithm of 8 with base 2 is 3
There is another common base for logarithms. It is called as “Euler’s number (e).” Its approximate value is 2.71828. It is widely used in statistics. The logarithm with base e is called as Natural Logarithm.
It also has interesting transformative capabilities. It transforms an exponential relation into a linear relation. Let us look at an example:
The diagram below, shows an exponential relationship between y and x:
If logarithms are applied to both x and y, the relationship between log(x) and log(y) is linear. It looks something like this:
Elasticity is the measurement of how responsive an economic variable is to a change in another.
Say that we have a function: Q = f(P) then the elasticity of Q is defined as:
E = P/Q x dQ/dP
- dq/dP is the average change of Q wrt change in P.
Bringing it all together:
Now let us bring these three mathematical characters together. Derivatives, Logarithms and Exponential. Their rules of engagement are as follows:
- Logarithm of e is 1 i.e. log(e) = 1
- The logarithm of an exponential is exponent multiplied by the base.
- Derivative of log(x) is : 1/x
Let us take an example. Imagine a function y expressed as follows:
- y = b^x.
- => log(y) = x log (b)
So does it mean for linear regression models? Can we do mathematical juggling to make use of derivatives, logarithms, and exponents? Can we rewrite the linear model equation to find the rate of change of y wrt change in x?
First, let us define relationship between y and x as an exponential relationship
- y = α x^β
- Let us first express this as a function of log-log: log(y) = log(α) + β.log(x)
- Doesn’t equation #1 look similar to regression model: Y= β0 + β1 . x1 ? where β0 = log(α); β1 = β. This equation can be now rewritten as: log(y) = β0 + β1. log(x1)
- But how does it represent elasticity? Let us take derivative of log(y) wrt x, we get the following:
- d. log(y)/ dx = β1. log(x1)/dx.=> 1/y . dy/dx = β1 . 1/x => β1 = x/y . dy/dx.
- The equation of β1 is the elasticity.
Now that we understand the concept, let us see how Fernando build a model. He builds the following model:
log(price) = β0 + β1. log(engine size) + β2. log(horse power) + β3. log(width)
He wants to estimate the change in car price as a function of the change in engine size, horse power, and width.
Fernando trains the model in his statistical package and gets the following coefficients.
The equation of the model is:
log(price) = -21.6672 + 0.4702.log(engineSize) + 0.4621.log(horsePower) + 6.3564 .log(width)
Following is the interpretation of the model:
- All coefficients are significant.
- Adjusted r-squared is 0.8276 => the model explains 82.76% of variation in data.
- If the engine size increases by 4.7% then the price of the car increases by 10%.
- If the horse power increases by 4.62% then the price of the car increases by 10%.
- If the width of the car increases by 6% then the price of the car increases by 1 %.
Fernando has now built the log-log regression model. He evaluates the performance of the model on both training and test data.
Recall, that he had split the data into the training and the testing set. The training data is used to create the model. The testing data is the unseen data. Performance on testing data is the real test.
On the training data, the model performs quite well. The adjusted R-squared is 0.8276 => the model can explain 82.76% variation on training data. For the model to be acceptable, it also needs to perform well on testing data.
Fernando tests the model performance on test data set. The model computes the adjusted r-squared as 0.8186on testing data. This is good. It means that model can explain 81.86% of variation even on unseen data.
Note that the model estimates the log(price) and not the price of the car. To convert the estimated log(price) into the price, there needs to be a transformation.
The transformation is treating the log(price) as an exponent to the base e.
e^log(price) = price
The last few posts have been quite a journey. Statistical learning laid the foundations. Hypothesis testing discussed the concept of NULL and alternate hypothesis. Simple linear regression models made regression simple. We then progressed into the world of multivariate regression models. Then discussed model selection methods. In this post, we discussed the log-log regression models.
So far the regression models built had only numeric independent variables. The next post we will deal with concepts of interactions and qualitative variables.
Originally published here
Data Science Simplified Part 7: Log-Log Regression Models
Data Science Simplified Part 2: Key Concepts of Statistical Learning
In the first article of this series, I had touched upon key concepts and processes of Data Science. In this article, I will dive in a bit deeper. First, I will define what is Statistical learning. Then, we will dive into key concepts in Statistical learning. Believe me; it is simple.
As per Wikipedia, Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.
Machine learning is a manifestation of statistical learning techniques implemented through software applications.
What does this mean in practice? Statistical learning refers to tools and techniques that enable us to understand data better. Let’s take a step back here. What do we mean by understanding the data?
In the context of Statistical learning, there are two types of data:
- Data that can be controlled directly a.k.a independent variables.
- Data that cannot be controlled directly a.k.a dependent variables.
The data that can’t be controlled i.e. dependent variables need to predicted or estimated.
Understanding the data better is to figure out more about the dependent variable in terms of independent variables. Let me illustrate it with an example:
Say that I want to measure sales based on the advertising budget I allocate for TV, Radio, and Print. I can control the budget that I can assign to TV, Radio, and Print. What I cannot control is how they will impact the sales. I want to express data that I cannot control (sales) as a function of data that I can control (advertising budget). I want to uncover this hidden relationship.
Statistical learning reveals hidden data relationships. Relationships between the dependent and the independent data.
Parameters and Models:
One of the famous business models in operations management is the ITO model. It stands for Input-Transformation-Output model. It is simple. There are inputs. These inputs undergo some transformations. An output is created.
Statistical learning also applies a similar concept. There are input data. Input data is transformed. The output, something that needs to be predicted or estimated, is generated.
The transformation engine is called as a Model. These are functions that estimate the output.
The transformation is mathematical. Mathematical ingredients are added to the input data to estimate the output. These ingredients are called as the Parameters.
Let’s walk through an example:
What determines someone’s income? Say that income is a determined by one’s year’s education and years of experience. A model that estimate is the income can be something like this:
income = c + β0 x education + β1 x experience
β0 and β1 are parameters that express income as a function of education and experience.
Education and experience are controllable variables. These controllable variables have different synonyms. They are called as dependent variables. They are also called as features.
Income is uncontrollable variable. They are called as targets.
Training and Testing:
What do we do when we have to prepare for an examination? Study. Learn. Imbibe. Take notes. Practice mock papers. These are the tools to learn and to prepare for the unseen test.
Machine Learning also uses a similar concept for learning. Data is finite. The available data needs to be used prudently. The model built needs to be validated. The way to validate it is the following:
- Split the data into two parts.
- Use one part for training. Let the model learn from it. Let the model see the data. This dataset is called as the Training Data.
- Use the other part for testing the model. Give the model “a test” on the unseen data. This dataset is called as the Testing Data.
In a competitive examination, if the preparation is adequate, if the learning is sound then the performance on the test is expected to be satisfactory as well. Similarly, in a machine learning, if the model has learned well from the training data, it will perform well on the test data.
Similarly, in machine learning, once the model is tested on the test dataset, the performance of the model is evaluated. It is assessed based on how close it has estimated the output to the actual value.
Variance and Bias:
George Box, a famous British statistician, once quoted:
“All models are wrong; some are useful.”
No model is 100% accurate. All models have errors. These errors emanate from two sources:
Let me attempt to explain this using an analogy.
Raj, a seven-year-old kid, was just introduced to the concept of multiplication. He had mastered the tables of 1 and 2. His next challenge was to learn the table of 3. He was very excited and started to practice the multiplication table of 3. His table is something like this:
- 3 x 1 = 4
- 3 x 2 = 6
- 3 x 3 = 10
- 3 x 4 = 13
- 3 x 5 = 16
Bob, Raj’s classmate, was on the same boat. His table looked something like this:
- 3 x 1 = 5
- 3 x 2 = 9
- 3 x 3 = 18
- 3 x 4 = 24
- 3 x 5 = 30
Let us examine the multiplication models created by Bob and Raj from a machine learning perspective.
- Raj’s model has an invalid assumption. It assumes that the operation of multiplication implies to add one after the result. This assumption introduces the bias error. The assumption is consistent i.e. add 1 to the output. This means that Raj’s model has a low bias.
- Raj’s model results in output that is consistently 1 number away from the actual. This means that his model has a low variance.
- Bob’s model’s output is all over the place. His model outputs deviates a lot from the actual value. There is no consistent pattern for deviation. Bob’s model has high biasand high variation.
The above example is a crude explanation of the important concept of variance and bias.
- Bias is the model’s tendency to consistently learn the wrong thing by not taking into account all the information in the data.
- Variance is the model’s tendency to acquire random things irrespective of the real signal.
I have a school friend who was an amazing student. He was relatively weaker in Mathematics. The way he used to study Mathematics was through rote learning. He learned and memorized the Mathematical problems. He could “recite” them very well.
The challenge was the following: The examination problems were not the same problem that he memorized. The problems were a generalized application of the Mathematical concept. Obviously, he had a tough time clearing the exams.
Machine learning problems follow the same pattern. If the model learns too much about a particular data set and tries to apply the same model to the unseen data, it will have a high error. Learning too much from a given dataset is called as overfitting. It has not generalized the learning to be usefully applied to unseen data. On the other end of the spectrum, learning too little results in underfitting. The model is so poor that it can’t even learn from the given data.
Albert Einstein summarizes this concept succinctly. He said:
“Everything should be made as simple as possible, but no simpler.”
A constant endeavor in Machine learning problem is to strike a right balance. Create a model that is not too complex and not too simple. Create a generalized model. Create a model that is relatively inaccurate but useful.inaccurate but useful.
- A model that overfits it complex. It performs very well on training data. It performs poorly on testing data.
- A model that underfits is too simplistic. It doesn’t perform will on both training and testing data.
- A good model balances underfitting and overfitting. It generalizes well. It is as simple as possible but no simpler.
This balancing act is called bias-variance trade-off.
Statistical learning is the building block of complex machine learning applications. This article introduces some of the fundamental and essential concepts of Statistical learning. Top 5 takeaways from this article are:
- Statistical learning reveals hidden data relationships. Relationships between the dependent and the independent data.
- Model is the transformation engine. Parameters are the ingredients that enable the transformation.
- A model uses the training data to learn. A model uses the testing data to evaluate.
- All models are wrong; some are useful.
- Bias-variance trade-off is a balancing act. Balance to find optimal model. Balance to find the sweet spot.
We will delve deeper into specifics of machine learning models in subsequent articles of this series.
Data Science Simplified Part 2: Key Concepts of Statistical Learning
Data Science Simplified: Principles and Process
In 2006, Clive Humbly, UK Mathematician, and architect of Tesco’s Clubcard coined the phrase “Data is the new oil. He said the following:
”Data is the new oil. It’s valuable, but if unrefined it cannot be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so, must data be broken down, analyzed for it to have value.”
The iPhone revolution, growth of the mobile economy, advancements in Big Data technology has created a perfect storm. In 2012, HBR published an article that put Data Scientists on the radar.
The article Data Scientist: The Sexiest Job of the 21st Century labeled this “new breed” of people; a hybrid of data hacker, analyst, communicator, and trusted adviser.
Every organization is now making attempts to be more data-driven. Machine learning techniques have helped them in this endeavor. I realize that a lot of the material out there is too technical and difficult to understand. In this series of articles, my aim is to simplify Data Science. I will take a cue from the Stanford course/book (An Introduction to Statistical Learning). This attempt is to make Data Science easy to understand for everyone.
In this article, I will begin by covering fundamental principles, general process and types of problems in Data Science.
Data Science is a multi-disciplinary field. It is the intersection between the following domains:
- Business Knowledge
- Statistical Learning aka Machine Learning
- Computer Programming
The focus of this series will be to simplify the Machine Learning aspect of Data Science. In this article, I will begin by covering principles, general process and types of problems in Data Science.
- Data is a strategic asset: This concept is an organizational mindset. The question to ask is: “Are we using the all the data asset that we are collecting and storing? Are we able to extract meaningful insights from them?”. I’m sure that the answers to these question are “No”. Companies that are cloud born are intrinsically data-driven. It is in their psyche to treat data as a strategic asset. This mindset is not valid for most of the organization.
- A systematic process for knowledge extraction: A methodical process needs to be in place for extracting insights from data. This process should have clear and distinct stages with clear deliverables. The Cross Industry Standard Process for Data Mining (CRISP-DM) is one such process.
- Sleeping with the data: Organisations need to invest in people who are passionate about data. Transforming data into insight is not alchemy. There are no alchemists. They need evangelists who understand the value of data. They need evangelists who are data literate and creative.They need folks who can connect data, technology, and business.
- Embracing uncertainty: Data Science is not a silver bullet. It is not a crystal ball. Like reports and KPIs, it is a decision enabler. Data Science is a tool and not a means to end. It is not in the realm of absolute. It is in the realm of probabilities. Managers and decision makers need to embrace this fact. They need to embrace quantified uncertainty in their decision-making process. Such uncertainty can only be entrenched if the organizational culture adopts a fail fast-learn fast approach. It will only thrive if organizations choose a culture of experimentation.
- The BAB principle: I perceive this as the most important principle. The focus of a lot of Data Science literature is on models and algorithms. The equation is devoid of business context. Business-Analytics-Business (BAB) is the principle that emphasizes the business part of the equation. Putting them in a business context is pivotal. Define the business problem. Use analytics to solve it. Integrate the output into the business process. BAB.
Taking a cue from principle #2, let me now emphasize on the process part of data science. Following are the stages of a typical data science project:
1. Define Business Problem
Albert Einstein once quoted “Everything should be made as simple as possible, but not simpler”. This quote is the crux of defining the business problem. Problem statements need to be developed and framed. Clear success criteria need to be established. In my experience, business teams are too busy with their operational tasks at hand. It doesn’t mean that they don’t have challenges that need to be addressed. Brainstorming sessions, workshops, and interviews can help to uncover these challenges and develop hypotheses. Let me illustrate this with an example. Let us assume that a telco company has seen a decline in their year-on-year revenue due to a reduction in their customer base. In this scenario, the business problem may be defined as:
- The company need grow the customer base by targeting new segments and reducing customer churn.
2. Decompose To Machine Learning Tasks
The business problem, once defined, needs to be decomposed to machine learning tasks. Let’s elaborate on the example that we have set above. If the organization needs to grow our the customer base by targeting new segments and reducing customer churn, how can we decompose it into machine learning problems? Following is an example of decomposition:
- Reduce the customer churn by x %.
- Identify new customer segments for targeted marketing.
3. Data Preparation
Once we have defined the business problem and decomposed into machine learning problems, we need to dive deeper into the data. Data understanding should be explicit to the problem at hand. It should help us with to develop right kind of strategies for analysis. Key things to note is the source of data, quality of data, data bias, etc.
4. Exploratory Data Analysis
A cosmonaut traverses through the unknowns of the cosmos. Similarly, a data scientist traverses through the unknowns of the patterns in the data, peeks into the intrigues of its characteristics and formulates the unexplored. Exploratory data analysis (EDA) is an exciting task. We get to understand the data better, investigate the nuances, discover hidden patterns, develop new features and formulate modeling strategies.
After EDA, we move on to the modeling phase. Here, based on our specific machine learning problems, we apply useful algorithms like regressions, decision trees, random forests, etc.
6. Deployment and Evaluation
Finally, the developed models are deployed. They are continuously monitored to observe how they behaved in the real world and calibrated accordingly.
Typically, the modeling and deployment part is only 20% of the work. 80% of the work is getting your hands dirty with data, exploring the data and understanding it.
Machine Learning Problem Types
In general, machine learning has two kinds of tasks:
Supervised learning is a type of machine learning task where there is a defined target. Conceptually, a modeler will supervise the machine learning model to achieve a particular goal. Supervised Learning can be further classified into two types:
Regression is the workhorse of machine learning tasks. They are used to estimate or predict a numerical variable. Few examples of regression models can be:
- What is the estimate of the potential revenue next quarter?
- How many deals can I close next year?
As the name suggests, classification models classify something. It is estimated which bucket something is best suited. Classification models are frequently used in all types of applications. Few examples of classification models are:
- Spam filtering is a popular implementation of a classification model. Here every incoming e-mail is classified as spam or not spam based on certain characteristics.
- Churn prediction is another important application of classification models. Churn models used widely in telcos to classify whether a given customer will churn (i.e. cease to use the service) or not.
Unsupervised learning is a class of machine learning task where there are no targets. Since unsupervised learning doesn’t have any specified target, the result that they churn out may be sometimes difficult to interpret. There are a lot of types of unsupervised learning tasks. The key ones are:
- Clustering: Clustering is a process of group similar things together. Customer segmentation uses clustering methods.
- Association: Association is a method of finding products that are frequently matched with each other. Market Basket analysis in retail uses association method to bundle products together.
- Link Prediction: Link prediction is used to find the connection between data items. Recommendation engines employed by Facebook, Amazon and Netflix heavily use link prediction algorithms to recommend us friends, items to purchase and movies respectively.
- Data Reduction: Data reduction methods are used to simplify data set from a lot of features to a few features. It takes a large data set with many attributes and finds ways to express them in terms of fewer attributes.
Machine Learning Task to Models to Algorithm
Once we have broken down business problems into machine learning tasks, one or many algorithms can solve a given machine learning task. Typically, the model is trained on multiple algorithms. The algorithm or set of algorithms that provide the best result is chosen for deployment.
Azure Machine Learning has more than 30 pre-built algorithms that can be used for training machine learning models.
Azure Machine Learning cheat-sheet will help to navigate through it.
Data Science is a broad field. It is an exciting field. It is an art. It is a science. In this article, we have just explored the surface of the iceberg. The “hows” will be futile if the “whys” are not known. In the subsequent articles, we will explore the “hows” of machine learning.
Data Science Simplified: Principles and Process
Demystifying Data Lake Architecture
Demystifying Data Lake Architecture
According to Gartner, 80% of successful CDOs will have value creation or revenue generation as their Number 1 priority through 2021.
To create the maximum value out the organization’s data landscape, traditional decision support system architecture are no longer adequate. New architectural patterns need to be developed to harness the power of data. To fully capture the value of using big data, organizations need to have flexible data architectures and able to extract maximum value from their data ecosystem.
Data Lake concept has been around for some time now. However, I have seen organizations struggle to understand the concept as many of them are still boxed in the older paradigm of Enterprise Data Warehouses.
In this article, I will deep-dive into conceptual constructs of Data Lake Architecture pattern and layout an architecture pattern.
Let us start with the known first.
Traditional Data Warehouse (DWH) Architecture:
Traditional Enterprise DWH architecture pattern has been used for many years. There are data sources, data is extracted, transformed and loaded (ETL) and on the way, we do some kind of structure creation, cleansing etc. We predefine the data model in EDW (dimensional model or 3NF model) and then create departmental data marts for reporting, OLAP cubes for slicing and dicing and self-service BI.
This pattern is quite ubiquitous and has served us well for a long time now.
However, there are some inherent challenges in this pattern that can’t scale in the era of Big Data. Let us look at few of them:
- Firstly, the philosophy with which we work is that we need to understand the data first. What is the source system structure, what kind of data it holds, what is the cardinality, how should be the model it based on the business requirements, are there any anomalies in data so on and so forth. This is a tedious and complex work. I used to spend at least 2–3 months in the requirement analysis and data analysis phase. The EDW projects span for a few months to a few years. And this is all based on the assumption that the business knows the requirements.
- We also have to make choices and compromises on which data to store and which data to discard. A lot of time is spent upfront on deciding what to bring in, how to bring in, how to store, how to transform etc. Lesser time is spent on actually performing data discovery, uncovering patterns, or creating a new hypothesis for business value add.
Definition of Data:
Let us now discuss briefly how the definition of data has changed. The 4 Vs of big data are now very well known. Volume, velocity, variety, and veracity. Let me put some context to these things:
- Data volumes have exploded in since the iPhone revolution. There are 6 billion smartphones and nearly 1PB of data is created every year.
- Data is not just at rest. There are streaming data, IoT enabled connected devices. A plethora of data emanating from multiple fronts.
- It is also about the variety of data. Video feeds, photographs all are data points now which demand to be analyzed and exploited.
- With the explosion of data also comes the challenge of data quality. Which one should be trusted and which one should not be is a bigger challenge in the big data world.
In short, the definition of analysable data has changed. It is not just structured corporate data now but all kinds of data. The challenge is to mash them up together and make sense out of them.
Since 2000 there has been tremendous changes in the processing capabilities, storage, and the corresponding cost structure. It has been subjected to what we call as Moore’s Law.Key points:
- Processing capabilities have increased by around 10,000 times since 2000. This implies that the ability to analyze more data efficiently has increased.
- The cost of storage has also come down quite considerably. Since 2000, the cost of storage has come down over 1000 times.
The Data Lake Analogy:
Let me explain the concept of Data Lake using an analogy.
Visiting a large lake is always a very pleasant feeling. The water in the lake is in its purest form and there are different activities different people perform on the Lake. Some are people are fishing, some people are enjoying a boat ride, this lake also supplies drinking water to people living in Ontario. In short, the same lake is used for multiple purposes.
With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.
The way I compare a data lake to a data warehouse or a mart is like this:
Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.
Conceptual Data Lake Architecture:
Having explained the concept, let me now walk you through a conceptual architecture of data lake. Here are the key components in a data lake architecture. We have our data sources which can be structured and unstructured. They all integrate into a raw data store that consumes data in the purest possible form i.e. no transformations. It is a cheap persistent storage that can store data at scale. Then we have the analytical sandbox that is used for understanding the data, creating prototypes, performing data science and exploring the data to build new hypothesis and use-cases.
Then we have batch processing engine that processes the raw data into something that can be consumed by the users i.e. a structure that can be used to reporting to the end-user. We call it as a processed data store. There is a real-time processing engine that takes streaming data and processes it as well. All the data in this Architecture is cataloged and curated.
Let me walk you through each component group in this Architecture.
The first component group caters to processing data. It follows an Architecture pattern that is called as Lambda Architecture. Basically, Lambda architecture takes two processing path. A batch layer and a speed layer. Batch layer stores data in the rawest possible form i.e. raw data store and speed layer processes the data near real time. Speed layer also stores data into the raw data store and may store transient data before loading into processed data stores.
Analytical sandboxes are one of the key components in data lake architecture. These are the exploratory areas for data scientists where they can develop and test the new hypothesis, mash-up and explore data to form new use-cases, create rapid prototypes to validate these use-cases and realize what can be done to extract value out of the business.
It’s the place where data scientists can discover data, extract value and help to transform the business.
Cataloging and Governance:
Data cataloging is an important principle that been constantly overlooked in traditional business intelligence. In the big data landscape, cataloging is the most important aspect that one should focus on. Let me first give an analogy to explain what is cataloging. I do this exercise with my customers to get the point of cataloging across.
When I ask my customers to guess the potential cost of the painting without providing the catalog information, the answer ranged from $100 to $100,000 dollars. The answer to much closer to the actual when I provide the catalog information. By the way, this painting is called as the ‘The old Guitarist’ by Pablo Picasso created in 1903. Its estimated cost is more than $100 million.
The idea of data catalog is very similar. Different data nuggets have different value and this value varies based on the lineage of the data, quality of data, the source of creation etc. The data needs to be cataloged so that a data analyst or a data scientist can decide for themselves which data point to use for a specific analysis.
The catalog map provides potential metadata that can be cataloged. Cataloging is a process of captures valuable meta data so that it can be used to determine the characteristics of data and to arrive at the decision to use it or not. There are basically two types of metadata: Business and Technical. Business metadata is more to do with the definitions, logical data models, logical entities and so on whereas the technical metadata is to capture the metadata related to the physical implementation of the data structure. It includes things like the database, quality score, the columns, schema etc.
Based on the catalog information, an analyst can choose to use a specific data point in the right context. Let me give you an example. Imagine that a data scientist wants to do an exploratory analysis of Inventory Turnover Ratio and the way it is defined in ERP and an inventory system is different. If the term is cataloged, the data scientist, based on the context can decide to use the column from ERP or from the Inventory system.
Key Difference between Data Lake and EDW:
Here is an explicit slide that tries to explain the difference.
- First, the philosophy is different. In a data lake architecture, we load data first in raw for and decide what should we do with it. In traditional DWH architecture, we must first understand the data, model it and then load it in.
- Data in a data lake is stored in the raw form where data in DWH is stored in a structured form. Remember the Lake and the distilled water.
- Data lake supports all kinds of users.
- Analytics projects are really agile projects. The nature of these projects are that once you see the output, you think more and want more. Data Lakes are agile by nature. Since they store all data with their catalogs, it ensures that if new requirements emerge, it can be adapted to quite easily.
Data Lake Architecture on Azure:
Cloud platforms are best suited to implement the Data Lake Architecture. They have the host of compose-able services that can be weaved together to achieve the required scalability. Microsoft’s Cortana Intelligence Suite provides one or more components that can be mapped to fruition the Data Lake Architecture.
- Data Lakes is a new paradigm shift for Big Data Architecture.
- Data Lakes caters to all kinds of data, stores data in the raw form caters to a spectrum of users and enables faster insights.
- Meticulous data cataloging and governance are key for successful data lake implementation.
- Cloud Platforms offers an end-end solution for implementation of data lake architecture in an economical and scalable way.
Demystifying Data Lake Architecture