线性回归里的 uncentered R2

今天在帮人看一个 Python 的 statsmodels 包的 OLS 模型时，发现一个很有意思的统计量 uncentered R2。

下面例子可复现：

import numpy as np
import statsmodels.api as sm

x = np.random.random((100, 2))
y = np.random.random((100,))

print(sm.OLS(y, x).fit().summary())

得到结果如下：

                                 OLS Regression Results
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.586
Model:                            OLS   Adj. R-squared (uncentered):              0.577
Method:                 Least Squares   F-statistic:                              69.24
Date:                Wed, 30 Jun 2021   Prob (F-statistic):                    1.80e-19
Time:                        14:45:24   Log-Likelihood:                         -45.966
No. Observations:                 100   AIC:                                      95.93
Df Residuals:                      98   BIC:                                      101.1
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4958      0.099      4.999      0.000       0.299       0.693
x2             0.3593      0.099      3.630      0.000       0.163       0.556
==============================================================================
Omnibus:                       22.867   Durbin-Watson:                   1.654
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.222
Skew:                          -0.055   Prob(JB):                       0.0735
Kurtosis:                       1.886   Cond. No.                         2.53
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们发现得到一个巨高的 R2 ，但标记了 uncentered。那它和普通的 R2 有什么区别呢？

对于一个回归模型：

$y = xb + e,\ \ \hat{y} = xb$

普通的 R2 定义如下（只在有常数项下成立）：

$R^2 = \frac{\|\hat{y} - \bar{y}\|^2}{\|y - \bar{y}\|^2 } = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2}$

其中 $\bar{y}$ 是 $y$ 的均值。

而 uncentered R2 则定义为：

$R^2 = 1 - \frac{\|e\|^2}{\|y\|^2}$

为什么上面的 statsmodels.api.OLS 例子会显示 uncentered R2 呢？这是因为没有提供常数项，将依赖项增加一个常数列，就是正常的 R2 了：

print(sm.OLS(y, sm.add_constant(x)).fit().summary())

将输出正常的统计量：

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.027
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     1.339
Date:                Wed, 30 Jun 2021   Prob (F-statistic):              0.267
Time:                        15:17:56   Log-Likelihood:                -24.625
No. Observations:                 100   AIC:                             55.25
Df Residuals:                      97   BIC:                             63.07
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6213      0.086      7.186      0.000       0.450       0.793
x1            -0.0603      0.112     -0.540      0.590      -0.282       0.161
x2            -0.1731      0.109     -1.584      0.116      -0.390       0.044
==============================================================================
Omnibus:                       30.571   Durbin-Watson:                   1.862
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.801
Skew:                          -0.016   Prob(JB):                       0.0550
Kurtosis:                       1.821   Cond. No.                         5.55
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

除此之外， OLS 还提供一个让人很迷惑的hasconst参数，该参数不是用来添加常数项，而是用来标记用户提供的 $x$ 是否包含常数项（默认情况下将自动识别）。如果设置了hasconst=False，模型将输出上面的uncentered R2。但若用户未提供常数项，却设置了hasconst=True:

print(sm.OLS(y, x, hasconst=True).fit().summary())

模型将输出下面的结果：

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                      -0.491
Model:                            OLS   Adj. R-squared:                 -0.506
Method:                 Least Squares   F-statistic:                    -32.28
Date:                Wed, 30 Jun 2021   Prob (F-statistic):               1.00
Time:                        16:03:51   Log-Likelihood:                -45.966
No. Observations:                 100   AIC:                             95.93
Df Residuals:                      98   BIC:                             101.1
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4958      0.099      4.999      0.000       0.299       0.693
x2             0.3593      0.099      3.630      0.000       0.163       0.556
==============================================================================
Omnibus:                       22.867   Durbin-Watson:                   1.654
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.222
Skew:                          -0.055   Prob(JB):                       0.0735
Kurtosis:                       1.886   Cond. No.                         2.53
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

也就是模型给出了一个负的 R2 ！这是因为 statsmodels 实际在计算 R2 时使用了下面的公式：

$R^2 = 1 - \frac{\|e\|^2}{\|y-\bar{y}\|^2}$

如果回归时不提供常数项，那么误差项的平方和可能高于样本偏差，从而导致小于 0 的 R2。

Q. E. D.

跟着绿野的队伍，晚上 6 点半出发，从公园东门进，沿小路直接上鬼笑石，再往南到陈家大院，到翠微绝顶，再沿着八大处的城墙到香界寺、天书，再一路走到鬼笑石，看看风景下撤到东门回家。总行程 12 公里多点，爬升 670 米。部分路段在晚上有一定难度。