Manish Kumar BarnwalJun 6, 2017
The essence of machine learning is function estimation
Machine learning is cool. There is no denying in that. In this post we will try to make it a little uncool, well it will still be cool but you may start looking at it differently. Machine learning is not a black box. It is intuitive and this post is just to convey that.
If I give you this function f(x) = x^2 + log(x)
and ask you to tell me what will be f(2),
you will first laugh at me and then run away to do something important. This is trivial for you, right? If a function is there that maps inputs to outputs then it is very easy to get the output for any new input.
Machine learning helps you get a function that can map the input to the output. How does it do it? What is this function? We will try to answer such questions in the paragraphs below.
Let us try to answer the above questions using a problem that can be solved using machine learning. Assume, you are a technical recruiter. You have been running a recruitment firm for the last 3 years. Now you being tech savvy, you follow the latest trends in technology and you came to know about machine learning. You understand that machine learning can be used to predict the future given you have data from the past.
You thought, how can I use it to predict the expected salary of a candidate given other factors. The first thing that comes to your mind do you have the data? And you hear out a pleasant yes!
You have the following data collected at individual level:

Age of the candidate

Gender of the candidate

Number of years of experience

Highest level education degree

College – Top notch, Average, normal

Current salary

Sector – IT, Finance, Electronics

Salary
And a few others. For now, let us assume we have just these features and we want to predict the expected salary using these features. We have 3 years of data that has approximately 10,000 rows. So your dataset looks something like the below data:
So essentially we have seven independent features, X – age, gender, years of experience, highest level of education, college, current salary, sector and corresponding salary, Y. What we want is next time when we have a candidate, we would obviously have his age, gender, years of experience, and other features. What we won’t have is his salary. And, we want to estimate this value.
There would be some function, say f that would map these X to the Y values. How do we find this function? We will use the 3 years of data we have – the training data.
We won’t be able to find the actual function, the true function, f because we don’t have all of the data in the world. You can’t collect the entire dataset available. It is impossible. What we use is a sample of data from the population. And, we use this sample as our training data.
Many a time, there are some factors that can’t be captured. The set of independent features that we have captured is not an exhaustive list. There would obviously be other features that will have an impact on the salary.
Say in our example of salary prediction, some of the factors like exclusive and exceptional knowledge on some rare topics may land a candidate exorbitant offers from few of the companies. It is difficult to capture factors like these.
Now we understand why we can’t have the true f. So we will try to get an estimate of f, say f^. We want this f^ to be as close to the true f i.e. a proxy for the true function. There would obviously be an error in estimating this true function and w want to minimize this error to as low as possible. How do we go about getting this f^, an estimate of the true function, f?
We have the data, remember the 3 years of historical data which contained the X features and the corresponding Y values. This is called the training data and there is a reason why it is called training data. Because we use this data to train the underlying algorithms to get the estimated function f^.
You get that we use training data to get the estimated function, f^. But how do we do it? We try to minimize the error between the true salary, Y and the predicted salary, Y^ from the model. For now, understand that there is a way to minimize this error and get the estimated function.
Now, these functions could be a simple one like having a linear relationship between the salary and the features or many a time a complex relationship which is not linear. There are techniques say linear regression or say decision trees that help you get the simple estimate or even a complex one respectively.
Once you have this estimate of function,
f(age, gender, years of experience, highest level of education, college, current salary, sector) —> salary
you just pass in the X and you should get your Y. There, you have a machine learning model. And you know what you have done – you have just come up with a nice estimate of the true function.
Once you have this estimate, there are other questions that you might want to think over. How good is an estimate this function to the true function? What all assumptions you made to estimate this function? When would this estimate not be a good choice?. I will try to answer these questions in future posts. For now, I hope you get the gist that the essence of machine learning is function estimation.
I write more about machine learning, big data, and life experiences here.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let’s talk more of datascience.
The essence of machine learning is function estimation
Manish Kumar BarnwalJun 1, 2017
Random Forests explained intuitively
Random Forests algorithm has always fascinated me. I like how this algorithm can be easily explained to anyone without much hassle. One quick example, I use very frequently to explain the working of random forests is the way a company has multiple rounds of interview to hire a candidate. Let me elaborate.
Say, you appeared for the position of Statistical analyst at WalmartLabs. Now like most of the companies, you don’t just have one round of interview. You have multiple rounds of interviews. Each one of these interviews is chaired by independent panels. Each panel assesses the candidate separately and independently. Generally, even the questions asked in these interviews differ from each other. Randomness is important here.
The other thing of utmost importance is diversity. The reason we have a panel of interviews is that we assume a committee of people generally takes better decision than a single individual. Now, this committee is not any collection of people. We make sure that the interview panel is a little diversified in terms of topics to be covered in each interview, the type of questions asked, and many other details. You don’t go about asking the same question in each round of interviews.
After having all the rounds of interviews, the final call whether to select or reject the candidate is based on the majority of the decision from each panel. If out of 5 panels of interviewers, 3 recommends a hire and two against a hire, we tend to go ahead with selecting the candidate. I hope you get the gist.
If you have heard about decision tree, then you are not very far from understanding what random forests are. There are two keywords here – random and forests. Let us first understand what forest means. Random forest is a collection of many decision trees. Instead of relying on a single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called – a forest. So you now understand why is it called a forest.
Why is it called random then?
Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.

At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.
Let me draw an analogy. Let us now understand how interview selection process resembles a random forest algorithm. Each panel in the interview process is actually a decision tree. Each panel gives a result whether the candidate is a pass or fail and then a majority of these results is declared as final. Say there were 5 panels, 3 said yes and 2 said no. The final verdict will be yes.
Something similar happens in the random forest as well. The results from each of the tree are taken and the final result is declared accordingly. Voting and averaging is used to predict in case of classification and regression respectively.
With the advent of huge computational power at our disposal, we hardly think for even a second before we apply random forests. And very conveniently our predictions are made. Let us try to understand other aspects of this algorithm.
When is a random forest a poor choice relative to other algorithms?

Random forests don’t train well on smaller datasets as it fails to pick on the pattern. To simplify, say we know that 1 pen costs INR 1, 2 pens cost INR 2, 3 pens cost INR 6. In this case, linear regression will easily estimate the cost of 4 pens but random forests will fail to come up with a good estimate.

There is a problem of interpretability with random forest. You can’t see or understand the relationship between the response and the independent variables. Understand that random forest is a predictive tool and not a descriptive tool. You get variable importance but this may not suffice in many analysis of interests where the objective might be to see the relationship between response and the independent features.

The time taken to train random forests may sometimes be too huge as you train multiple decision trees. Also, in the case of a categorical variable, the time complexity increases exponentially. For a categorical column with n levels, RF tries split at 2^n 1 points to find the maximal splitting point. However, with the power of H2O we can now train random forests pretty fast. You may want to read about H2O at H2O in R explained.

In the case of a regression problem, the range of values response variable can take is determined by the values already available in the training dataset. Unlike linear regression, decision trees and hence random forest can’t take values outside the training data.
What are the advantages of using random forest?

Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.

When all you care about is the predictions and want a quick and dirty way out, random forest comes to the rescue. You don’t have to worry much about the assumptions of the model or linearity in the dataset.
I will add in the R code snippets as well to get an idea of how this is executed soon.
I write more on data science, machine learning and life experiences at my blog. Please stop by.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let’s talk more of datascience.
Random Forests explained intuitively
Recent Comments