线性回归里的 uncentered R2

作者: , 共 6731 字 , 共阅读 0

今天在帮人看一个 Python 的 statsmodels 包的 OLS 模型时,发现一个很有意思的统计量 uncentered R2。

下面例子可复现:

import numpy as np
import statsmodels.api as sm

x = np.random.random((100, 2))
y = np.random.random((100,))

print(sm.OLS(y, x).fit().summary())

得到结果如下:

                                 OLS Regression Results
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.586
Model:                            OLS   Adj. R-squared (uncentered):              0.577
Method:                 Least Squares   F-statistic:                              69.24
Date:                Wed, 30 Jun 2021   Prob (F-statistic):                    1.80e-19
Time:                        14:45:24   Log-Likelihood:                         -45.966
No. Observations:                 100   AIC:                                      95.93
Df Residuals:                      98   BIC:                                      101.1
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4958      0.099      4.999      0.000       0.299       0.693
x2             0.3593      0.099      3.630      0.000       0.163       0.556
==============================================================================
Omnibus:                       22.867   Durbin-Watson:                   1.654
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.222
Skew:                          -0.055   Prob(JB):                       0.0735
Kurtosis:                       1.886   Cond. No.                         2.53
==============================================================================

Notes:
[1]  is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们发现得到一个巨高的 R2 ,但标记了 uncentered。那它和普通的 R2 有什么区别呢?

对于一个回归模型:

$$ y = xb + e,\ \ \hat{y} = xb $$

普通的 R2 定义如下(只在有常数项下成立):

$$ R^2 = \frac{\|\hat{y} - \bar{y}\|^2}{\|y - \bar{y}\|^2 } = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2} $$

其中$ \bar{y} $$y$的均值。

而 uncentered R2 则定义为:

$$ R^2 = 1 - \frac{\|e\|^2}{\|y\|^2} $$

为什么上面的 statsmodels.api.OLS 例子会显示 uncentered R2 呢?这是因为没有提供常数项,将依赖项增加一个常数列,就是正常的 R2 了:

print(sm.OLS(y, sm.add_constant(x)).fit().summary())

将输出正常的统计量:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.027
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     1.339
Date:                Wed, 30 Jun 2021   Prob (F-statistic):              0.267
Time:                        15:17:56   Log-Likelihood:                -24.625
No. Observations:                 100   AIC:                             55.25
Df Residuals:                      97   BIC:                             63.07
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6213      0.086      7.186      0.000       0.450       0.793
x1            -0.0603      0.112     -0.540      0.590      -0.282       0.161
x2            -0.1731      0.109     -1.584      0.116      -0.390       0.044
==============================================================================
Omnibus:                       30.571   Durbin-Watson:                   1.862
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.801
Skew:                          -0.016   Prob(JB):                       0.0550
Kurtosis:                       1.821   Cond. No.                         5.55
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

除此之外, OLS 还提供一个让人很迷惑的hasconst参数,该参数不是用来添加常数项,而是用来标记用户提供的$x$是否包含常数项(默认情况下将自动识别)。如果设置了hasconst=False,模型将输出上面的uncentered R2。但若用户未提供常数项,却设置了hasconst=True:

print(sm.OLS(y, x, hasconst=True).fit().summary())

模型将输出下面的结果:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                      -0.491
Model:                            OLS   Adj. R-squared:                 -0.506
Method:                 Least Squares   F-statistic:                    -32.28
Date:                Wed, 30 Jun 2021   Prob (F-statistic):               1.00
Time:                        16:03:51   Log-Likelihood:                -45.966
No. Observations:                 100   AIC:                             95.93
Df Residuals:                      98   BIC:                             101.1
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4958      0.099      4.999      0.000       0.299       0.693
x2             0.3593      0.099      3.630      0.000       0.163       0.556
==============================================================================
Omnibus:                       22.867   Durbin-Watson:                   1.654
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.222
Skew:                          -0.055   Prob(JB):                       0.0735
Kurtosis:                       1.886   Cond. No.                         2.53
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

也就是模型给出了一个负的 R2 !这是因为 statsmodels 实际在计算 R2 时使用了下面的公式:

$$ R^2 = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2}$$

如果回归时不提供常数项,那么误差项的平方和可能高于样本偏差,从而导致小于 0 的 R2。

Q. E. D.

跟着绿野的队伍,晚上 6 点半出发,从公园东门进,沿小路直接上鬼笑石,再往南到陈家大院,到翠微绝顶,再沿着八大处的城墙到香界寺、天书,再一路走到鬼笑石,看看风景下撤到东门回家。总行程 12 公里多点,爬升 670 米。部分路段在晚上有一定难度。
后一篇:
碎碎念 » 扫雷
昨天刚加入了扫雷网:http://saolei.wang/Player/Index.asp?Id=24185