Summary: In this study, we demonstrate that traditional MMM regression parameter estimation in a common situation may be very unstable, especially when the model is incorrectly specified and needs to be supplemented by special stability checking.
Let us say the model gives a result where National TV contributes 3% of sales, Print, 1%, and Radio, 0.5%. These recommendations are to be used for media planning. However, a week after getting the model, an analyst on the client side discovered that the Price variable used in the model was calculated incorrectly, and he asked the modeler to redo the model with the correct price variable. The modeler did so, and to their mutual surprise it turned out that in the new model TV contributes 2%, Print, 2%, and Radio enters with a negative coefficient, i.e., should be rejected entirely. How is it possible? In other words, how sensitive are a regression model and its coefficients to other variables in a model?
It is not a simple question. Ideally, a given coefficient should not depend on "environment", i.e., on other factors unless there is a special synergetic effect of the mutual acting (we do not consider it here; usually it is not big anyway). In practice, however, they always do, but with different levels. Logically, the more stable the coefficient (i.e., the less it depends on environment), the more we believe it reflects the real process. The problem is much more serious than modelers usually think it is. Here are some results of experiment (see for details Mandel, 2007):
An artificial data set contained 500 observations and 10 independent random variables X, used for construction of dependent variable Y=1*X1+5*X5+…+1000*X100+E (the coefficient name X5 means that coefficient for generation is 5; error E is a normal noise, added in such a way that total determination became around 88%). Generated Y was then estimated through ordinary regression procedure many times: first all regressions were calculated with one variable, Y=a1*X1, Y=a5*X5, etc.; then regressions with all pairs, like Y=a1*X1+a5*X5; Y= a1*X1+a5*X100, etc.; then all triples, fourths and so on, around 1,000 equations (2^10).
The conventional logic tells us that the closer the number of variables is to 10, i.e., the real number used for generation, the better the estimation of each coefficient should be (“specification rule”). Surprisingly, it is not always true, even when variables are practically not correlated. For example, the best coefficient for 40X was 85; three variables behave nonmonotonically, and precision of one coefficient was decreasing while the number of factors increased, and this was when the exact specification of the model was known, i.e., only factors really affecting results have been used!
But if we add three factors Z which have not been used for data generation, but correlated with Y, the distortion of the coefficients and violation of the specification rule is much stronger. The charts below show two lines: pink for average estimation of the coefficient with exact specification (“true”) and blue for the coefficient estimation for same variable but together with factors Z. The horizontal axis indicates the number of variables used for regression estimation (13; 10 “real” and 3 “unused”); the vertical axis, the average deviation of the estimated value from the real one over all regressions run on this number of variables. For example, on a graph for X1000 the estimation with exact specification always shows good results – the maximal deviation (when the variable was estimated just alone) is about 15% and is going to zero when the number of variables goes to 13. But after adding Z variables the discrepancy goes up very fast, from 15% to 60%. One may say, for this variable the specification rule works for “real” factors but not at all with unused factors. All other graphs support that idea. The variety of patterns is remarkable, but the common part is that in all cases, there are no signs of saturation while approaching 10 factors.
In other words, in a situation when several factors do not build the outcome but simply correlate with it for whatever reasons, it creates a real mess in estimation. But who may guarantee that all factors in a model are “real” in the strong sense we used here?
This and other similar results demonstrate a serious flaw of traditional regression modeling, if implemented without very detailed analysis. We checked stability of the real data for the large CPG company and demonstrated that many factors in a “to be built model” are very unstable, i.e., the traditional MMM methodology applied would be unreliable. There are several ways to minimize that risk we selectively use, which actually constitutes an advanced form of modeling.
