Select Page

## Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

This article is intended for practitioners who might not necessarily be statisticians or statistically-savvy. The mathematical level is kept as simple as possible, yet I present an original, simple approach to test for randomness, with an interesting application to illustrate the methodology. This material is not something usually discussed in textbooks or classrooms (even for statistical students), offering a fresh perspective, and out-of-the-box tools that are useful in many contexts, as an addition or alternative to traditional tests that are widely used. This article is written as a tutorial, but it also features an interesting research result in the last section.

1. Context

Let us assume that you are dealing with a time series with discrete time increments (for instance, daily observations) as opposed to a time-continuous process. The approach here is to apply and adapt techniques used for time-continuous processes, to time-discrete processes. More specifically (for those familiar with stochastic processes) we are dealing here with discrete Poisson processes. The main question that we want to answer is: Are some events occurring randomly, or is there a mechanism making the events not occurring randomly? What is the gap distribution between two successive events of the same type?

In a time-continuous setting (Poisson process) the distribution in question is modeled by the exponential distribution. In the discrete case investigated here, the discrete Poisson process turns out to be a Markov chain, and we are dealing with geometric, rather than exponential distributions. Let us illustrate this with an example.

Example

The digits of the square root of two (SQRT(2)), are believed to be distributed as if they were occurring randomly. Each of the 10 digits 0, 1, … , 9 appears with a frequency of 10% based on observations, and at any position in the decimal expansion of SQRT(2), on average the next digit does not seem to depend on the value of the previous digit (in short, its value is unpredictable.)  An event in this context is defined, for example, as a digit being equal to (say) 3. The next event is the first time when we find a subsequent digit also equal to 3. The gap (or time elapsed) between two occurrences of the same digit is the main metric that we are interested in, and it is denoted as G. If the digits were distributed just like random numbers, the distribution of the gap G between two occurrences of the same digit, would be geometric, that is,

with p = 1 / 10 in this case, as each of the 10 digits (0, 1, …, 9) seems — based on observations — to have a frequency of 10%. We will show that this is indeed the case: In other words, in our example, the gap G is very well approximated by a geometric distribution of parameter p = 1 / 10, based on an analysis of the first 10 million digits of SQRT(2).

What else should I look for, and how to proceed?

Studying the distribution of gaps can reveal patterns that standard tests might fail to catch. Another statistic worth studying is the maximum gap, see this article. This is sometimes referred to as extreme events / outlier analysis. Also, in our above example, studying gaps between groups of digits (not just single digits, but for instance how frequently the “word” 234567 repeats itself in the sequence of digits, and what is the distribution of the gap for that word. For any word consisting of 6 digits, p = 1 / 1,000,000.  In our case, our data set only has 10 million digits, so you may find 234567 maybe only 2 times, maybe not even once, and looking at the gap between successive occurrences of 234567, is pointless. Shorter words make more sense. This and other issues are discussed in the next section.

2. Methodology

The first step is to estimate the probabilities p associated with the model, that is, the probability for a specific event, to occur at any time. It can easily be estimated from your data set, and generally, you get a different p for each type of event. Then you need to use an algorithm to compute the empirical (observed) distribution of gaps between two successive occurrences of the same event. In our example, we have 10 types of events, each associated with the occurrence of one of the 10 digits 0, 1, …, 9 in the decimal representation of SQRT(2).  The gap computation can be efficiently performed as follows:

Algorithm to compute the observed gap distribution

Do a loop over all your observations (in our case, the 10 first million digits of SQRT(2), stored in a file; each of these 10 million digits is one observation). Within the loop, at each iteration t, do:

• Let E be the event showing up in the data set, at iteration t. For instance, the occurrence of (say) digit 3 in our case. Retrieve its last occurrence stored in an array, say LastOccurrences[E]
• Compute the gap G as G = t – LastOccurrences[E]
• Update the LastOccurrences table as follows: LastOccurrences[E] = t
• Update the gap distribution table, denoted as GapTable (a two-dimensional array or better, an hash table) as follows: GapTable[E, G]++

Once you have completed the loop, all the information that you need is stored in the GapTable summary table.

Statistical testing

If some events occur randomly, the theoretical distribution of the gap, for these events, is know to be geometric, see above formula in first section. So you must test whether the empirical gap distribution (computed with the above algorithm) is statistically different from the theoretical geometric distribution of parameter p (remember that each type of event may have a different p.) If not statistically different, then the assumption of randomness should be discarded: you’ve found some patterns. This work is typically done using a Kolmogorov- Smirnov test. If you are not a statistician but instead a BI analyst or engineer, other techniques can be used instead, and are illustrated in the last section:

• You can simulate events that are perfectly randomly distributed, and compare the gap distribution obtained in your simulations, with that computed on your observations. See here how to do it, especially the last comment featuring an efficient way to do it. This Monte-Carlo simulation approach will appear to operations research analysts.
• In Excel, plot the gap distribution computed on your observations (one for each type of event), add a trendline, and optionally, display the trendline equation and its R-Squared. When choosing a trendline (model fitting) in Excel, you must select  the Exponential one. This is what we did (see next section) and the good news is that, despite the very limited selection of models that Excel offers, Exponential is one of them. You can actually test other trendlines in Excel (polynomial, linear, power, or logarithmic) and you will see that by far, Exponential offers the best fit — if your events are really randomly distributed.

If you have collected a large number of observations (say 10 million) you can do the testing on samples of increasing sizes (1,000, 10,000, 100,000 consecutive observations and so on) to see how fast the empirical distribution converges (or not) to the theoretical geometric distribution. You can also compare the behavior across samples (cross-validation), or across types of events (variance analysis). If your data set is too small (100 data points) or your events too rare (p less than 1%), consider increasing the size of your data set if possible.

Even with big data, if you are testing a large number of rare events (in our case, tons of large “words” such as occurrences 234567 rather than single digits in the decimal representation of SQRT(2)) expect many tests to result in false negatives (failure to  detect true randomness.) You can even compute the probability for this to happen, assuming all your events are perfectly randomly distributed. This is known as the curse of big data.

3. Application to Number Theory Problem

Here, we further discuss the example used throughout this article to illustrate the concepts. Mathematical constants (and indeed the immense majority of all numbers) are thought to have their digits distributed as if they were randomly generated, see here for details.

Many tests have been performed on many well known constants (see here), and none of them was able to identify any departure from randomness. The gap test illustrated here is less well known, and when applied to SQRT(2), it was also unable to find departure from randomness. In fact, the fit with a random distribution, as shown in the figure below, is almost perfect.

There is a simple formula to compute any digit of SQRT(2) separately, see here, however it is not practical. Instead, we used a table of 10 million digits published here by NASA. The source claims that digits beyond the first five million have not been double-checked, so we only used the first 5 million digits. The summary gap table, methodological details, and the above picture, can be found in my spreadsheet. You can download it here

The above chart shows a perfect fit between the observed distribution of gap lengths (averaged across the 10 digits 0, 1, …, 9) between successive occurrences of a same digit in the first 5 million decimals of SQRT(2), and the geometric distribution model, using the Exponential trendline in Excel.

I also explored the last 2 million decimals available in the NASA table, and despite the fact that they have not been double-checked, they also display the exact same random behavior. Maybe these decimals are all wrong but the mechanism that generates them preserves randomness, or maybe all or most of them are correct.

A potential application is to use digits that appear to be randomly generated (like white noise, and the digits of SQRT(2) seem to fit the bill) in documents, at random positions that only the recipient could reconstruct, perhaps three or four random digits on average for each real character in the original document, before encrypting it, to increase security — a bit like steganography. Encoding the same document a second time would result in a different kind of white noise added to the original document, and peppered randomly, each time differently — with a different intensity, and at different locations each time. This would make the task of hackers more complicated.

Conclusion

Finally, this is an example where intuiting can be wrong, and why you need data science. In the digits of SQRT(2), while looking at the first few hundred digits (see picture below), it looked to me like it was anything but random. There were too many 99, two few 37 (among other things), according to my intuition and visual inspection (you may call it gut feelings.) It turns out that I was wrong. Look at the first few hundred digits below, chances are that your intuition will also mislead you into thinking that there are some patterns. This can be explained by the fact that patterns such as 99 are easily detected by the human brain and do stand out visually, yet in this case, they do occur with the right frequency if you use analytic tools to analyze the digits.

First few hundred digits or SQRT(2). Do you see any pattern?

DSC Resources

Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

## DECIPHERING BIG DECISIONS: A ROADMAP FOR DATA-DRIVEN ORGANIZATION.

DECIPHERING BIG DECISIONS: A ROADMAP FOR DATA-DRIVEN ORGANIZATION.

Building a data-driven organization requires Identifying and prioritising the big-decisions which need to be supported with data-driven actionable insights; the next big question is how exactly does one go about identifying those 10% of the decisions which influence 90% of the business outcomes?

We have reasoned that the best method to create a roadmap for  data-driven organization would be by identifying and  supporting  those critical decisions with analytics based on relevant data.

THE LOST ART OF DECISION MAKING: DECISION-DRIVEN BEFORE DATA-DRIVEN

The logical first-step obviously was to list down all the decisions taken in the company.  We very quickly discovered there has never been any documented list of decisions in the company. Then we made an attempt at informally discussing with some of the senior managers in different departments. The conversations were all similar, if not same. Summary below –

1. Everyone believes they take a lot of decisions…though most cannot recall anyone of them..
2. Everyone agrees the quality of decisions does make a difference to the quality of business-outcomes
3. Instinctively, they know some decisions are more important than others, but never paused to think or list down those important decisions.
4. When forced to list at least one important decision that they took in the preceding 12 months, those who could, invariably mentioned investment decisions involving large capital outlay.

Some have asked me the equally disconcerting question back – “How about you? Do you have a list of important decisions that you take?”  … Honest answer was that I was equally guilty.  I never made my own list.

While there was no published research on developing a roadmap for a data-driven organization, there were a few well-written articles on the importance identifying critical organizational decisions.

Taking inspiration from the Decision X-Ray process of Bain & Company,  I have created a Questionnaire that I could distribute amongst the managers  to collect and collate the list of organizational decisions.  Bain’s case study refers to 33 critical decisions that managers at Nike identified…

However, my initial experience administering the questionnaire was a disaster. Most managers simply did not know what to list as a decision, and Identifying critical decisions was even more difficult.  Nearly all of them have listed ‘Investment decisions’ (which involved purchase of capital goods or capitalisable services) as the most critical decisions that they take.

We have also included a few questions on how exactly does each of the decisions listed impacts Organizations’ top-line and Bottom-line, or impact on meeting strategic objectives. Most managers had difficulty explaining the \$\$ impact of their decisions on business-outcomes.

TYPES OF DECISIONS

The first recorded attempt at  identifying and categorising decisions has been made by the war-hero (and 34th President) General Eisenhower.

The famous Eisenhower principle – Prioritising tasks by urgency and importance results in 4 quadrants with different work strategies: What is important is seldom urgent, and what is urgent is seldom important.

We saw a few typical text-book style categorisation of decisions.

1. Programmed and non-programmed decisions:
2. Major and minor decisions
3. Routine and strategic decisions:
4. Organizational and personal decision:
5. Individual and group decisions:
6. Policy and operating decisions:

While we expected, there would be a mountain of published research on categorisation of decisions, and proven methodology for identifying the most critical organizational decisions, we were completely disappointed…

There was little to no published research on the importance of categorisation of decisions, let alone a process for identifying the top 10%.

The Notable Exceptions:

1. What Makes Strategic Decisions Different? – HBR November 2013 Issue, Phil Rosenzweig

While the article does not give any process for categorization of decisions, it does argue that categorization of decisions, and recognizing strategic decisions as different from routine decisions is an important first step before advising people on how to make better strategic decisions.

1. Decisions are Strategic. California Management Review, Vol 56, No.3, Spring 2014 (by Ram Shivakumar)

Under a chapter titled “A Framework for Evaluating Decisions”– it provides yet another 2×2 that has Degree of Commitment and the Scope of Firm on its two axes. The degree of commitment is reflected by the extent to which a particular decision is reversible; and the scope of a firm is defined by its choice of products, services, activities, and markets.

1. Bain & Company. Focus on Key Decisions, Bain Research, November 19, 2010 (Bain Brief By Marcia W. Blenko, Michael C. Mankins and Paul Rogers)

The most comprehensive and relevant article for our purpose. While we still do not get a definitive process for decision prioritisation, the article does provide the basic steps involved in the process. Excerpts below –

“An organization’s decision abilities determine its performance. Companies that make better decisions, make them faster and translate them into action more effectively nearly always outrun their competitors. But managers and employees in any large company make countless decisions every day. How can an individual manager or a leadership team know which decisions to focus on? How can it analyze those individual decisions to see what’s working and what isn’t?”

(The article) shows how to identify your organization’s critical decisions, the ones that most affect results. And it shows how to use a tool we call a decision X-ray to expose the trouble spots and begin to identify improvements. Taken together, these actions can tune up your organization to deliver peak performance.”

The article further talks about Critical decisions being of high-value, or of high-value over-time, and applying the 80-20 rule to identify critical decisions.

The process for identifying critical decisions (as per the article) is in two steps –

1. Creating a Decision Architecture: Creating business process of different parts of the organization in value-creation strategy so-as to list all the organization decisions.
2. Winnowing the list: Prioritizing decisions by Value, followed by Degree of Management Attention required.

The article further mentions “Decision X-ray process” (that Bain seems to have used extensively at Nike) which essentially is a diagnosis of a day-in-the-life-of-a-decision, what works and what doesn’t, enablers etc.

Our Journey

While one gets the drift of what Bain conveys as an idea through the article, we thought it was too simplistic to “winnow the list” just by two factors – 1) the cum. value of the decision, 2) Management Attention required. How much of Management Attention is required is an abstract number at best.

We have worked on a series of alternative models which possibly work on diverse set of businesses. We finally concluded that the key criteria for decision prioritisation is its’ impact on business-outcomes; hence one must identify those 10-20% of decisions which impact 90% of business outcomes”.

Over-time, we have evolved a methodology to score decisions based on their impact on business outcomes, and to run a Pareto-analysis to identify the key 10% decisions.

RECAPTURING THE PURPOSE

• Building a data-driven organization requires supporting organizational decisions with data & analytics … Given that not all decisions are equally important, Decisions need to be prioritised based on relative value.

• If the first step in building a data-driven organization is to choose the right decisions to support with data…then the next set of logical questions that need to be answered–
• Which decisions?
• How to categorise decisions?
• How to prioritise decisions?
• As it applies in economics, Pareto principle applies in business too …we find 80-90% business outcomes are influenced by 10-20% of the decisions. The question is – which 10-20%…?
• Those taken by CxOs?
• Capital Investment decisions?
• Strategic Direction decisions?

DECISION PRIORITISATION: FACTORS TO CONSIDER

When we started listing the criteria for decision prioritisation, we have very quickly agreed that the value-of-the-decision, and the scope of business the decision affects are two primary factors that need be taken into-account. Pretty much the-same set of factors that were mentioned in Bain’s article quoted above. We have also quickly recognized there are repetitive decisions which individually are of small-value, but add-up to a much larger value over a period-of-time.

But as we started listing decisions, and started discussing with the decision makers, we have been forced to list several other important factors that might significantly affect the prioritisation.

We found some decisions are far more complex than others, while some decisions can be rule-based, and routine. While some decisions lead to measurable outcomes, several others do not.

Some decisions lead to immediate results-and-outcomes in the near-term, while several other decisions lead to results-outcomes that can only be felt over a long period of time. Further, some decisions can be reversible, while many others are not. Then there are decisions that need to be taken with very little or no notice, while some decisions can be taken at leisure – after carefully considering all the implications. Unplanned decisions are usually reactionary and are taken in response to the occurrence of an unplanned-disruptive event.

We have also felt the planned decisions which emanate from Organizations strategy-map and the balanced scorecard need a higher priority.

So, the list-of factors (not in any particular-order) which influence decisions is given below –

• Originating from Balanced Scorecard
• Strategic-Tactical-Operational
• Decision Span (Lob’s / Revenue-Streams / Markets Covered)
• Impact & Value to Business
• Customer Experience
• Frequency
• Time-Frame Available to Decide
• Time to Outcome
• Retractability
• Long-Term Impact
• Opportunity Cost
• Availability of Quality Data
• Complexity
• Uncertainty
• Cost-Benefit

In our experience, the weightage to be accorded to each of these factors will have to be different in in different companies, in different lines-of businesses, in different countries. So, the exercise of decision prioritisation once again gets complicated, if you are dealing with a multi-national operating in diverse businesses.

Added to the above, would be what are called ‘dimensions of decisions’ by consulting companies like Decision Management Solutions. (I wish I had seen more insight into the meaningful work this company does – But I really go to see most of it last year. Decision Management Solutions categorizes decisions based on their relative score on 9 different dimensions – Repeatability, Measurability, Time to Outcome, Approach, Upside vs. Downside, Regulation, Available Data, Historical Data & Uncertainty).

We however used the following, and they are not very different from the criteria one uses for categorization of decisions, except perhaps that each of the variables here can be clearly measured.

DECISION PRIORITIZATION: PUTTING TOGETHER A PROCESS

Finally, when we started putting together a process for decision prioritisation, what became very clear is that we need to list the decisions in the ‘decreasing order of annualised-\$\$ value-impact-on the business’.

One needs extra-care to understand the words ‘cumulative and annualized’ here – While cumulative refers the ‘collective value of small-value decisions’ which has substantially large impact on the organizations’ business outcomes, the word ‘annualised’ refers to fact that the \$\$ impact of the decision may spread across more-than-one financial year and hence the data needs to be normalised to the value in one financial year.

Following Pareto’s principle, the actual number of decisions which impact 80-90% of business value is expected to be small… while the actual number and the percentage may differ in different companies.

However, we could clearly see that once we assign a score against each of the decision dimensions, the whole perspective of which decision is important could change when we do a cross-dimensional comparison.

Reference Fig shown here – The size of the bubble indicates the \$\$ value of the investment into each of the decisions.

The decision which has highest ‘cumulative annualized value’ could turn out be most complex decision, perhaps with a strategic long-term impact, while there could be another decision which has high cum. annualized value, and relatively less complex which can be a more obvious priority for data-driven decision making.

A few decisions which represents the highest cumulative-annualized-value, besides being most complex, may also have little-to-no data available. Hence, such decisions tend to be largely intuitive, purely based on the experience of the managers.

CONCLUSION:

Cross-dimensional comparison helps in identifying if a specific decision qualifies as one among the top-10% opportunities for data-driven managerial interventions which potentially influence 90% of business-outcomes.

## Boost your data science skills. Learn linear algebra.

Boost your data science skills. Learn linear algebra.

I’d like to introduce a series of blog posts and their corresponding Python Notebooks gathering notes on the Deep Learning Book from Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). The aim of these notebooks is to help beginners/advanced beginners to grasp linear algebra concepts underlying deep learning and machine learning. Acquiring these skills can boost your ability to understand and apply various data science algorithms. In my opinion, it is one of the bedrock of machine learning, deep learning and data science.

These notes cover the chapter 2 on Linear Algebra. I liked this chapter because it gives a sense of what is most used in the domain of machine learning and deep learning. It is thus a great syllabus for anyone who want to dive in deep learning and acquire the concepts of linear algebra useful to better understand deep learning algorithms.

### Getting started with linear algebra

The goal of this series is to provide content for beginners who wants to understand enough linear algebra to be comfortable with machine learning and deep learning. However, I think that the chapter on linear algebra from the book is a bit tough for beginners. So I decided to produce code, examples and drawings on each part of this chapter in order to add steps that may not be obvious for beginners. I also think that you can convey as much information and knowledge through examples than through general definitions. The illustrations are a way to see the big picture of an idea. Finally, I think that coding is a great tool to experiment concretely these abstract mathematical notions. Along with pen and paper, it adds a layer of what you can try to push your understanding through new horizons.

Coding is a great tool to concretely experiment abstract mathematical notions

Graphical representation is also very helpful to understand linear algebra. I tried to bind the concepts with plots (and code to produce it). The type of representation I liked most by doing this series is the fact that you can see any matrix as linear transformation of the space. In several chapters we will extend this idea and see how it can be useful to understand eigendecomposition, Singular Value Decomposition (SVD) or the Principal Components Analysis (PCA).

### The use of Python/Numpy

In addition, I noticed that creating and reading examples is really helpful to understand the theory. It is why I built Python notebooks. The goal is two folds:

1. To provide a starting point to use Python/Numpy to apply linear algebra concepts. And since the final goal is to use linear algebra concepts for data science it seems natural to continuously go between theory and code. All you will need is a working Python installation with major mathematical librairies like Numpy/Scipy/Matplotlib.

2. Give a more concrete vision of the underlying concepts. I found hugely useful to play and experiment with these notebooks in order to build my understanding of somewhat complicated theoretical concepts or notations. I hope that reading them will be as useful.

### Syllabus

The syllabus follow exactly the book so you can find more details if you can’t understand one specific point while you are reading it. Here is a short description of the content:

1. Scalars, Vectors, Matrices and Tensors

Light introduction to vectors, matrices, transpose and basic operations (addition of vectors of matrices). Introduces also Numpy functions and finally a word on broadcasting.

2. Multiplying Matrices and Vectors

This chapter is mainly on the dot product (vector and/or matrix multiplication). We will also see some of its properties. Then, we will see how to synthesize a system of linear equations using matrix notation. This is a major process for the following chapters.

3. Identity and Inverse Matrices

We will see two important matrices: the identity matrix and the inverse matrix. We will see why they are important in linear algebra and how to use it with Numpy. Finally, we will see an example on how to solve a system of linear equations with the inverse matrix.

4. Linear Dependence and Span

In this chapter we will continue to study systems of linear equations. We will see that such systems can’t have more than one solution and less than an infinite number of solutions. We will see the intuition, the graphical representation and the proof behind this statement. Then we will go back to the matrix form of the system and consider what Gilbert Strang call the *row figure* (we are looking at the rows, that is to say multiple equations) and the *column figure* (looking at the columns, that is to say the linear combination of the coefficients). We will also see what is linear combination. Finally, we will see examples of overdetermined and underdetermined systems of equations.

5. Norms

The norm of a vector is a function that takes a vector in input and outputs a positive value. It can be thinks as the *length* of the vector. It is for example used to evaluate the distance between the prediction of a model and the actual value. We will see different kind of norms (L⁰, L¹, L²…) with examples.

6. Special Kinds of Matrices and Vectors

We have seen that some special matrices that are very interesting. We will see other type of vectors and matrices in this chapter. It is not a big chapter but it is important to understand the next ones.

7. Eigendecomposition

We will see some major concepts of linear algebra in this chapter. We will start by getting some ideas on eigenvectors and eigenvalues. We will see that a matrix can be seen as a linear transformation and that applying a matrix on its eigenvectors gives new vectors with same direction. Then we will see how to express quadratic equations into a matrix form. We will see that the eigendecomposition of the matrix corresponding to the quadratic equation can be used to find its minimum and maximum. As a bonus, we will also see how to visualize linear transformation in Python!

8. Singular Value Decomposition

We will see another way to decompose matrices: the Singular Value Decomposition or SVD. Since the beginning of this series, I emphasized the fact that you can see matrices as linear transformation in space. With the SVD, you decompose a matrix in three other matrices. We will see that we can see these new matrices as *sub-transformation* of the space. Instead of doing the transformation in one movement, we decompose it in three movements. As a bonus, we will apply the SVD to image processing. We will see the effect of SVD on an example image of Lucy the goose so keep on reading!

9. The Moore-Penrose Pseudoinverse

We saw that not all matrices have an inverse. It is unfortunate because the inverse is used to solve system of equations. In some cases, a system of equation has no solution, and thus the inverse doesn’t exist. However it can be useful to find a value that is almost a solution (in term of minimizing the error). This can be done with the pseudoinverse! We will see for instance how we can find the best-fit line of a set of data points with the pseudoinverse.

10. The Trace Operator

We will see what is the Trace of a matrix. It will be needed for the last chapter on the Principal Component Analysis (PCA).

11. The Determinant

This chapter is about the determinant of a matrix. This special number can tell us a lot of things about our matrix!

12. Example: Principal Components Analysis

This is the last chapter of this series on linear algebra! It is about Principal Components Analysis (PCA). We will use some knowledge that we acquired along the preceding chapters to understand this important data analysis tool!

### Requirements

This content is aimed at beginners but it should be easier for people with at least some experience with mathematics.

### Enjoy

I hope that you will find something interesting in that series. I tried to be as accurate as I could. If you find errors/misunderstandings/typos… Please report it! You can send me emails or open issues and pull request in the notebooks Github.

### References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

## Soccer and Machine Learning: 2 hot topics for 2018

Soccer and Machine Learning: 2 hot topics for 2018

## How good is a certain soccer player? Let’s find out applying Machine Learning to Fifa 18!

I’m sure you’ve probably heard about the 2018 FIFA Football World Cup in Russia everywhere during the last few months. And, if you are a techy too, I guess you also have realized that Machine Learning and Artificial Intelligence are buzzwords too. So, what better way to get ready for the World Cup than by practicing in a project that combines these two hot topics? In order to do that, we’re going to leverage a dataset of the Fifa 2018 video game. My goal is to show you how to create a predictive model that is able to forecast how good a soccer player is based on their game statistics (using Python in a Jupyter Notebook).

Fifa is one of the most well-known video games around the world. You’ve probably played it at least once, right? Although I’m not a fan of video games, when I saw this dataset collected by Aman Srivastava, I immediately thought that it was great for practicing some of the basics of any Machine Learning Project.

The Fifa 18 dataset was scraped from the website sofifa.com containing statistics and more than 70 attributes for each player in the Full version of FIFA 18. In this Github Project, you can access the CSV files that compose the dataset and some Jupyter notebooks with the python code used to collect the data.

Having said this, now let’s start!

### Getting started with the machine learning tutorial

We explained most of the basic concepts related to smart systems and how machine learning techniques could add smart capabilities to many kinds of systems in almost any domain that you can imagine. Among other things, we learned that a typical workflow for a Machine Learning Project usually looks like the one shown in the image below:

In this post we’ll go through a simplified view of this whole process, with a practical implementation of each phase. The main objective is to show most of the common steps performed during any machine learning project. Therefore, you could use it as a start point in case you need to address a machine learning project from scratch.

#### In what follows, we will:

• Apply some preprocessing steps to prepare the data.
• Then, we will perform a descriptive analysis of the data to better understand the main characteristics that they have.
• We will continue by practicing how to train different machine learning models using scikit-learn. It is one of the most
popular python libraries for machine learning. We will also use a subset of the dataset for training purposes.
• Then, we will iterate and evaluate the learned models by using unseen data. Later, we will compare them until we find a
good model that meets our expectations.
• Once we have chosen the candidate model, we will use it to perform predictions and to create a simple web application that
consumes this predictive model.

At the end we will arrive at a funny smart app like the one below. It will be able to predict how good a soccer player is based on their game statistics. Sounds cool, yeah? Well, let’s dive in!

### 1. Preparing the Data

Generally any machine learning project has an initial stage known as data prepapration, data cleaning or the preprocessing phase.

Its main objective is to collect and prepare the data that the learning algorithms will use during the training phase. In our practical and concrete example, an important part of this was already addressed by Aman Srivastava when he scraped different pages from the website sofifa.com. In his Github Project you can access some of the jupyter notebooks with the python code that acts as the data preprocessing modules that were applied to get and generate the original dataset for our project. Below, as an example, we can see the module that does the web scraping of the raw data (html format) and how it transforms the data into a Pandas dataframe (Pandas is a famous Python library for data processing). Finally it generates a csv file with the results. In some way, this data preparation step can be seen like something similar to the old ETLs (extract, transform, load) database processes.

## Data Fallacies to Avoid | An Illustrated Collection of Mistakes People Often Make When Analyzing Data

Data Fallacies to Avoid | An Illustrated Collection of Mistakes People Often Make When Analyzing Data

Recently my team embarked on a mission to make data more accessible and useful to everyone, creating free resources to utilise whilst studying, analyzing and interpreting data.

The first resource we created was ‘Data Fallacies to Avoid’, an illustrated collection of mistakes people often make when analyzing data.

1. Cherry Picking
The practice of selecting results that fit your claim and excluding those that don’t. The worst and most harmful example of being dishonest with data.

2. Data Dredging
Data dredging is the failure to acknowledge that the correlation was, in fact, the result of chance.

3. Survivorship Bias
Drawing conclusions from an incomplete set of data, because that data has ‘survived’ some selection criteria.

4. Cobra Effect
When an incentive produces the opposite result intended. Also known as a Perverse Incentive.

5. False Causality
To falsely assume when two events occur together that one must have caused the other.

6. Gerrymandering
The practice of deliberately manipulating boundaries of political districts in order to sway the result of an election.

7. Sampling Bias
Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand.

8. Gambler’s Fallacy
The mistaken belief that because something has happened more frequently than usual, it’s now less likely to happen in future and vice versa.

9. Hawthorne Effect
When the act of monitoring someone can affect that person’s behaviour. Also known as the Observer Effect.

10. Regression Fallacy
When something happens that’s unusually good or bad, over time it will revert back towards the average.

A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.

12. McNamara Fallacy
Relying solely on metrics in complex situations can cause you to lose sight of the bigger picture.

13. Overfitting
A more complex explanation will often describe your data better than a simple one. However, a simpler explanation is usually more representative of the underlying relationship.

14. Publication Bias
How interesting a research finding is affects how likely it is to be published, distorting our impression of reality.

15. Danger of Summary Metrics
It can be misleading to only look at the summary metrics of data sets.

We then transformed the collection into a wall poster designed to hang in educational facilities, workplaces and other areas where these mistakes are often made.