When to Categorize Continuous Predictor in a Regression Model?
Research fields usually follow the practice of categorizing continuous predictor variables, and they are the same who mostly use ANOVA. They often do it through median splits, the high value above the median and the low values below the median. However; this it seems is not that good an idea, and enlisted are some of the reasons to it:
- Median tends to vary from sample to sample. This makes the categories in various samples have various meanings.
- Values on one of the side of the median are considered equivalent. Any variation within the category is ignored; and simultaneously two values sitting next to each other, on either side of the median, are considered different.
- Nothing wrong in saying that the categorization is completely arbitrary; as most of the times, a high score is not necessarily high. Upon skewing the scale, as several are don with, even a value near the low end, can be seen up in the high category.
The way out of this dilemma is to be able to conclude whether to treat an independent variable as categorical or continuous. Data analysts are empowered to find real results which otherwise they might miss, is by knowing when it is appropriate, followed with the understanding of how it will affect the interpretation of parameters.
Instances when categorizing continuous predictors in a regression model benefits:
1. General linear model & continuous or categorical predictor
Let’s understand and accept the fact that general linear model is not concerned if the predictor you used is continuous or categorical. But you as a data analyst should choose the information you need from the analysis based on the coding of the predictor. Numerical predictors are usually coded with the actual numerical values while categorical variables are often coded with dummy variables—0 or 1.
If you are not getting into the details of coding schemes, and all the values of the predictor are 0 and 1, you will not see any real information about the distance between them. There has to be only one fixed distance, i.e., the difference between 0 and 1. Model, in this case, will return a parameter estimate which in reality will give information about the response, the difference in the means of the response for the two groups.
2. Predictor with several or many numerical values
However, if the predictor has several or many numerical values, the model tends to return a parameter estimate which uses the information about the distance between the predictor values – the slope of a line. And this slope uses both, the horizontal distances as the variation in the predictor and the vertical distances as the variation in the response.
So while deciding whether to treat a predictor as continuous or categorical, ideally you are deciding how important it is to include the detailed information about the variations of values in the predictor. As mentioned earlier, median splits create arbitrary groupings which eliminate all details. Enlisted are some of the situations, not that they do not throw away details; but are less arbitrary and hence providing all the information needed:
3. Quantitative scale reflects meaningful qualitative differences
A well-tested depression scale might have meaningful cut points which reflect “non”, “mild” and “clinical” depression levels. Similarly, some quantitative variables have natural meaningful qualitative cut points. The element of age in a study of retirement planning could be one of the examples. Here, people who have attained that age of retirement and eligibility for pensions, social security and medicate; will have qualitatively different situations as compared to those who are yet to reach that age.
4. Few values of the quantitative variable
Fitting a line may not be appropriate when there are few values of the quantitative variable. Mathematically, only two values should fit a line; however, the line would be forced through both the points, reflecting the means exactly. So as per above example, if that retirement study is of comparing workers at age 35 to those with the age of 55, the line will is unlikely going to be a true representation of the real relationship. It will be advisable to compare the means of these qualitative different groups. 3 points are better than 2, and if the study included ages between 35 and 55; fitting a line would certainly highlight more information.
5. Non-linear relationship
Considering 5 or 6 values of the predictor to be continuous is fine if they follow a line. But if they aren’t, it can be much more straightforward. It is an opportunity for you to learn more about the relationship between the variables by looking at which all variables are higher than others. It certainly helps and not putting efforts and wasting time in fitting a complicated non-linear model.
Considering the same retirement age example, there is data for workers aged 30, 35, 40, 45, 50, and 55; and the first four ages have the similar mean amount of retirement planning. But at age 50 it starts to go up and at 55 it jumps unexpectedly higher. One way to handle this would be to fit some sort of exponential model or step function. The second one could be to consider age as categorical, and see at which age the planning increased. Though this approach does not use all the information in the data, it certainly answers some research questions more directly.
Analysts should not categorize a variable just to fit into the analysis they are comfortable doing, instead, they should do it as it helps them understand the real relationship between their variables.
When to Categorize Continuous Predictor in a Regression Model?
Data and Analytics; Don’t Trust Numbers Blindly
Data & Analytics have become main-stream. Executives and their boards are increasingly starting to question whether their organizations are truly realizing the full value of the insights. A study suggests that 58% of organizations have difficulties evaluating the quality of the data and its reliability, raising a big question to the stakeholders as to “can you trust your data?” On one hand these is this set of people who are worried about the authenticity of their organizational data, or the data they intend to use.
On the other hand you may encounter a set of people coming up with lame excuses, and claiming that they are happy with their data-sets and find their data to be trustworthy. They are not in need of any kind of data cleansing or data processing or assistance of data management experts. They are not wrong completely at what they feel and so what they say. The recent reports by Gizmodo, The Independent, New York Post and various others, about “Balls have zero to me to me” where Facebook’s AI chatbots Bob & Alice created their own language. Such incidents are enough to send chills down your spines.
Investigations to aforementioned incidents are on, and most likely it would be the bad data or absence of data cleansing process; the root cause. Don’t get us wrong. We advocate data driven decisions. However, on a thoughtful note, all this and much more can happen only if your data is in place. When we are talking about the trustworthiness of your data, it’s the appropriateness and accuracy that we are referring to.
Evolvement from common sense to data sense
We as a society have moved away from decisions made based on limited information or gut feel, to a data and information driven society; where applicability of common sense is minimal or nil. However; the challenge is that though the society has evolved, people have not. Business and enterprises are still being led by baby boomers that are better suited to hunt mammoths and not take financial decisions based on accurate data and insights derived from them.
Now that, everyone has realized that human judgement in a business context is poor, organizations are increasingly basing decisions on data driven facts. But is their data trustworthy? Let’s see why they should not trust numbers blindly?
1. Question the data tracking set-up
Believe it or not, but a lot of things can go wrong. Even Google Analytics is prone to mistakes, which is backed up with this discussion on GA data. Anything and everything starting from data collection to data integration, data interpretation to data reporting; should be questioned rigorously. For example, events not named in an explanatory fashion, inclusion of start date and many more; can lead decision analysts to commit errors while calculating results.
2. Question the interpretation of numbers
Yes, the numbers can be misinterpreted if the context is not understood completely. Sales manages, would die thinking why the conversions rate was not going up, even after making improvements to purchase funnel. Unless it was questioned to discover that sales team has started an acquisition campaign, which did result in higher volume of visitors who were ‘less qualified’ than earlier; and hence less conversions.
In an opposite situation, if the conversion rate had skyrocketed, no one would have questioned the positive numbers and the sales manager would have taken pride in the hike of the conversion rate.
3. Question the successful metric
It’s time to move away from one size fits all belief. One key success metric for all, will certainly not work. For example, in publication industry – the user behavior varies depending on the device used, and so does the metrics too. They do not have the easy task of measuring against a purchase funnel, as in the aforementioned case, and it can be challenging to find the right KPIs to be taken care of. The most important aspect here is content consumption, which tends to show low performance on mobile, wrongly portraying that there is a problem.
Also at times, the metric is either neglected or not adapted completely. A same scroll depth target of 75% for desktop and mobile users both; is as good as neglecting the metric. On the mobile all the left hand and right hand elements of the desktop page were stacked one on another under the main content. It made the user to go only 50% upon reading the full article.
4. Question the good numbers first
If you witness a huge drop of conversion from page 4 to 5 in the purchase flow, it is obvious that you will check the user experience. But what about if the conversion goes up between those two pages, due to the reason that your user missed out on a crucial piece of information and all they saw was the “next” button. Numbers have the tendency to make you feel all is well, when actually it is not. Irrespective of the kind of analytics one uses, Predictive – Descriptive or Prescriptive; it cannot replace the value of regularly watching customers use your services.
5. Question the KPIs
One of the big culprits – thankfully slowly becoming irrelevant – is page views. If set as a target, someone will surely find ways to grow this KPI without any improvement in customer behavior. And this is probably how these endless galleries of images were born, where each picture counts as a page view, or articles broken down in multiple pages, with no benefit to the users.
Don’t take numbers for granted
Check and verify numbers, good and bad both, as if you are a quality assurance manager and testing the code. The most glorified numbers are the ones which can damage your business the most. Be critical and challenge the numbers. Also ensure to adapt the metrics completely, as you enhance your own product. And lastly, do not forget to combine it with qualitative insights.
Data and Analytics; Don’t Trust Numbers Blindly