今天在帮人看一个 Python 的 statsmodels 包的 OLS 模型时,发现一个很有意思的统计量 uncentered R2。
下面例子可复现:
import numpy as np
import statsmodels.api as sm
x = np.random.random((100, 2))
y = np.random.random((100,))
print(sm.OLS(y, x).fit().summary())
得到结果如下:
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.586
Model: OLS Adj. R-squared (uncentered): 0.577
Method: Least Squares F-statistic: 69.24
Date: Wed, 30 Jun 2021 Prob (F-statistic): 1.80e-19
Time: 14:45:24 Log-Likelihood: -45.966
No. Observations: 100 AIC: 95.93
Df Residuals: 98 BIC: 101.1
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.4958 0.099 4.999 0.000 0.299 0.693
x2 0.3593 0.099 3.630 0.000 0.163 0.556
==============================================================================
Omnibus: 22.867 Durbin-Watson: 1.654
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.222
Skew: -0.055 Prob(JB): 0.0735
Kurtosis: 1.886 Cond. No. 2.53
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
我们发现得到一个巨高的 R2 ,但标记了 uncentered。那它和普通的 R2 有什么区别呢?
对于一个回归模型:
$$ y = xb + e,\ \ \hat{y} = xb $$
普通的 R2 定义如下(只在有常数项下成立):
$$ R^2 = \frac{\|\hat{y} - \bar{y}\|^2}{\|y - \bar{y}\|^2 } = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2} $$
其中$ \bar{y} $ 是$y$的均值。
而 uncentered R2 则定义为:
$$ R^2 = 1 - \frac{\|e\|^2}{\|y\|^2} $$
为什么上面的 statsmodels.api.OLS 例子会显示 uncentered R2 呢?这是因为没有提供常数项,将依赖项增加一个常数列,就是正常的 R2 了:
print(sm.OLS(y, sm.add_constant(x)).fit().summary())
将输出正常的统计量:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.027
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 1.339
Date: Wed, 30 Jun 2021 Prob (F-statistic): 0.267
Time: 15:17:56 Log-Likelihood: -24.625
No. Observations: 100 AIC: 55.25
Df Residuals: 97 BIC: 63.07
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.6213 0.086 7.186 0.000 0.450 0.793
x1 -0.0603 0.112 -0.540 0.590 -0.282 0.161
x2 -0.1731 0.109 -1.584 0.116 -0.390 0.044
==============================================================================
Omnibus: 30.571 Durbin-Watson: 1.862
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.801
Skew: -0.016 Prob(JB): 0.0550
Kurtosis: 1.821 Cond. No. 5.55
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
除此之外, OLS 还提供一个让人很迷惑的hasconst
参数,该参数不是用来添加常数项,而是用来标记用户提供的$x$是否包含常数项(默认情况下将自动识别)。如果设置了hasconst=False
,模型将输出上面的uncentered R2
。但若用户未提供常数项,却设置了hasconst=True
:
print(sm.OLS(y, x, hasconst=True).fit().summary())
模型将输出下面的结果:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: -0.491
Model: OLS Adj. R-squared: -0.506
Method: Least Squares F-statistic: -32.28
Date: Wed, 30 Jun 2021 Prob (F-statistic): 1.00
Time: 16:03:51 Log-Likelihood: -45.966
No. Observations: 100 AIC: 95.93
Df Residuals: 98 BIC: 101.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.4958 0.099 4.999 0.000 0.299 0.693
x2 0.3593 0.099 3.630 0.000 0.163 0.556
==============================================================================
Omnibus: 22.867 Durbin-Watson: 1.654
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.222
Skew: -0.055 Prob(JB): 0.0735
Kurtosis: 1.886 Cond. No. 2.53
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
也就是模型给出了一个负的 R2 !这是因为 statsmodels 实际在计算 R2 时使用了下面的公式:
$$ R^2 = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2}$$
如果回归时不提供常数项,那么误差项的平方和可能高于样本偏差,从而导致小于 0 的 R2。
Q. E. D.