When to Categorize Continuous Predictor in a Regression Model?
Research fields usually follow the practice of categorizing continuous predictor variables, and they are the same who mostly use ANOVA. They often do it through median splits, the high value above the median and the low values below the median. However; this it seems is not that good an idea, and enlisted are some of the reasons to it:
- Median tends to vary from sample to sample. This makes the categories in various samples have various meanings.
- Values on one of the side of the median are considered equivalent. Any variation within the category is ignored; and simultaneously two values sitting next to each other, on either side of the median, are considered different.
- Nothing wrong in saying that the categorization is completely arbitrary; as most of the times, a high score is not necessarily high. Upon skewing the scale, as several are don with, even a value near the low end, can be seen up in the high category.
The way out of this dilemma is to be able to conclude whether to treat an independent variable as categorical or continuous. Data analysts are empowered to find real results which otherwise they might miss, is by knowing when it is appropriate, followed with the understanding of how it will affect the interpretation of parameters.
Instances when categorizing continuous predictors in a regression model benefits:
1. General linear model & continuous or categorical predictor
Let’s understand and accept the fact that general linear model is not concerned if the predictor you used is continuous or categorical. But you as a data analyst should choose the information you need from the analysis based on the coding of the predictor. Numerical predictors are usually coded with the actual numerical values while categorical variables are often coded with dummy variables—0 or 1.
If you are not getting into the details of coding schemes, and all the values of the predictor are 0 and 1, you will not see any real information about the distance between them. There has to be only one fixed distance, i.e., the difference between 0 and 1. Model, in this case, will return a parameter estimate which in reality will give information about the response, the difference in the means of the response for the two groups.
2. Predictor with several or many numerical values
However, if the predictor has several or many numerical values, the model tends to return a parameter estimate which uses the information about the distance between the predictor values – the slope of a line. And this slope uses both, the horizontal distances as the variation in the predictor and the vertical distances as the variation in the response.
So while deciding whether to treat a predictor as continuous or categorical, ideally you are deciding how important it is to include the detailed information about the variations of values in the predictor. As mentioned earlier, median splits create arbitrary groupings which eliminate all details. Enlisted are some of the situations, not that they do not throw away details; but are less arbitrary and hence providing all the information needed:
3. Quantitative scale reflects meaningful qualitative differences
A well-tested depression scale might have meaningful cut points which reflect “non”, “mild” and “clinical” depression levels. Similarly, some quantitative variables have natural meaningful qualitative cut points. The element of age in a study of retirement planning could be one of the examples. Here, people who have attained that age of retirement and eligibility for pensions, social security and medicate; will have qualitatively different situations as compared to those who are yet to reach that age.
4. Few values of the quantitative variable
Fitting a line may not be appropriate when there are few values of the quantitative variable. Mathematically, only two values should fit a line; however, the line would be forced through both the points, reflecting the means exactly. So as per above example, if that retirement study is of comparing workers at age 35 to those with the age of 55, the line will is unlikely going to be a true representation of the real relationship. It will be advisable to compare the means of these qualitative different groups. 3 points are better than 2, and if the study included ages between 35 and 55; fitting a line would certainly highlight more information.
5. Non-linear relationship
Considering 5 or 6 values of the predictor to be continuous is fine if they follow a line. But if they aren’t, it can be much more straightforward. It is an opportunity for you to learn more about the relationship between the variables by looking at which all variables are higher than others. It certainly helps and not putting efforts and wasting time in fitting a complicated non-linear model.
Considering the same retirement age example, there is data for workers aged 30, 35, 40, 45, 50, and 55; and the first four ages have the similar mean amount of retirement planning. But at age 50 it starts to go up and at 55 it jumps unexpectedly higher. One way to handle this would be to fit some sort of exponential model or step function. The second one could be to consider age as categorical, and see at which age the planning increased. Though this approach does not use all the information in the data, it certainly answers some research questions more directly.
Analysts should not categorize a variable just to fit into the analysis they are comfortable doing, instead, they should do it as it helps them understand the real relationship between their variables.