The usual explanation for why we need to omit one of the levels of a categorical variable when using dummy coding relies on the mathematics of linear regression. The explanation relies on the fact that including all the levels of the categorical variable along with the intercept results in perfect multicollinearity and thus there is no way to estimate the parameters of the regression model.
However, there is a deeper reason why we need to omit a level when we use dummy coding. Let us consider a specific context: Suppose that we model the relationship between sales (\(S\)) and advertising spend (\(A\)) for a cereal manufacturer as a linear regression model:
$$ S = \beta_0 + \beta_1 A + \epsilon$$
We can consider advertising spend as a continuous variable and thus can estimate both \(\beta_0\) and \(\beta_1\) in the above model. \(\beta_0\) represents the sales we would obtain when we do not spend anything on advertising and \(\beta_1\) would represent the incremental sales increase with every additional dollar spent on advertising.
Now consider a model where we want to estimate the impact of package size on sales. To be concrete, let us assume that the cereal box comes in three sizes: small, medium and large. Thus, using dummy variable coding, we would have three dummy variables for the three package sizes (denoted by \(S\), \(M\) and \(L\) for small, medium and large respectively). Thus, the model is:
$$ S = \beta_0 + \beta_s S + \beta_m M + \beta_l L + \epsilon$$
We have a problem if we want to understand the role played by \(\beta_0\) in the above model. The three values \(S\), \(M\) and \(L\) are never simultaneously \(0\). Therefore, we cannot interpret \(\beta_0\) in any meaningful way. The key issue is that categorical variables do not have a natural ‘zero point’ and thus, unlike the advertising model above, the intercept cannot be interpreted as the sales we would obtain when packages size is ‘zero’.
Suppose that we omit the small package size from our model. Then the model changes to:
$$ S = \beta_0 + \beta_m M + \beta_l L + \epsilon$$
In the above model, \(M\) and \(L\) are simultaneously \(0\) when the package size is small. Thus, \(\beta_0\) has a meaningful interpretation: it is the sales we obtain when package size is small. The remaining parameters can also be interpreted. For example, \(\beta_m\) can be interpreted as the incremental sales we get when we change the package size from ‘small’ to ‘medium’. By omitting the small package size, we are effectively setting the zero point for package size as small and letting the intercept capture the impact on sales when the package size is small.
In summary then, the above discussion highlights the fundamental role played by the zero. When the variable has a natural zero point (e.g., when the variable is continuous) then we have no issues. However, categorical variables do not have a natural zero point and hence when we use dummy variables we have to arbitrarily choose the zero point as one of the levels of the variable which in turn allows us to meaningfully interpret the intercept in the model.