diff --git a/Notebooks/Lecture9.ipynb b/Notebooks/Lecture9.ipynb deleted file mode 100644 index 07c014ffcf58e0c89c5918f8a4c568db7c0a4a16..0000000000000000000000000000000000000000 --- a/Notebooks/Lecture9.ipynb +++ /dev/null @@ -1 +0,0 @@ -{"cells":[{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D89B0EAB542642F98900CCC4D4E49360","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Lecture 9: linear model and model diagnostics \n","\n","## Instructor: 胡传鹏(博士)[Dr. Hu Chuan-Peng]\n","\n","### 南京师范大学心理学院[School of Psychology, Nanjing Normal University]\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D6CED1AC4FE848C38E69C72D1B2684FE","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Part 1: Linear Regression Model"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"38B25D21F42C45998BD273A3AD1F6A8E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 实例:\n","社交网站的使用可能在知识共享和个体创新行为中起到了关键作用,而个体在使用社交网站时往往能获得在线社会支持,可以帮助个体提高对自身创新行为产生的能力和信心的评估——即创新自我效能感。\n","\n","假设我们将探究大学生的社交网站使用和创新的自我效能感对个体创新行为的关系。\n","\n","`SNS_t`: 社交网站使用强度;\n","`CSES_t`: 创新自我效能感;\n","`EIB_t`: 创新行为\n","\n","数据来源:郑元瑞, 谢嘉敏, 李鹏. (2022) 社交网站使用强度对大学生创新行为的影响:一个有调节的中介模型. 心理技术与应用, 10(8), 483-491."]},{"cell_type":"code","execution_count":2,"metadata":{"collapsed":false,"id":"AFF5BF9CCB4342369F9B5E6E36A594D0","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["WARNING (theano.link.c.cmodule): install mkl with `conda install mkl-service`: No module named 'mkl'\n"]}],"source":["%matplotlib inline\n","import numpy as np \n","from scipy import stats\n","import matplotlib.pyplot as plt\n","import pandas as pd\n","import arviz as az\n","import pymc3 as pm\n","# mpl_toolkits.mplot3d是用于三维画图的,Axes3D用于画三维图\n","from mpl_toolkits.mplot3d import Axes3D"]},{"cell_type":"code","execution_count":2,"metadata":{"collapsed":false,"id":"9605159BD4814F2E8833134CA381C22E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["np.random.seed(123) # 随机数种子,确保随后生成的随机数相同\n","data = pd.read_csv(\"/home/mw/input/data8017/clean.csv\") # 读取数据,该数据需要提前挂载\n","data['SNS_t'] = (data['SNS_t'] - data['SNS_t'].mean()) / data['SNS_t'].std() # 将变量进行标准化\n","data['CSES_t'] = (data['CSES_t'] - data['CSES_t'].mean()) / data['CSES_t'].std() # 将变量进行标准化\n","data['EIB_t'] = (data['EIB_t'] - data['EIB_t'].mean()) / data['EIB_t'].std() # 将变量进行标准化"]},{"cell_type":"code","execution_count":3,"metadata":{"collapsed":false,"id":"B83ADDBCBE4146269D21B23083461386","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:16: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," app.launch_new_instance()\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["#创建一个画布fig1,该画布有两幅图ax1,ax2,画布尺寸为(10,4)\n","fig1,ax1 = plt.subplots(1,2,figsize=(10,4))\n","\n","#在第一张图的第一个panel上画散点图\n","ax1[0].scatter(data['SNS_t'],data['EIB_t'] )\n","ax1[0].set_xlabel('SNS_t')\n","ax1[0].set_ylabel('EIB_t')\n","\n","#在第一张图的第二个panel上画散点图\n","ax1[1].scatter(data['CSES_t'],data['EIB_t'] )\n","ax1[1].set_xlabel('CSES_t')\n","ax1[1].set_ylabel('EIB_t')\n","\n","#在第二张图(三维空间)画散点图\n","fig2 = plt.figure(2)\n","ax2 = Axes3D(fig2)\n","ax2.scatter(data['SNS_t'],data['CSES_t'],data['EIB_t'] )\n","ax2.set_xlabel('SNS_t')\n","ax2.set_ylabel('CSES_t')\n","ax2.set_zlabel('EIB_t')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2E0576B77E3243BF86B5FF4DE3914ED4","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可使用线性模型来探索`创新自我效能感`和`社交网站使用强度`对`个体创新行为`的关系。我们可以使用`PyMC3`来建立该方程。\n","\n","之所以选用线性模型,是因为线性模型是最简单的模型。\n","\n","我们需要注意的是,线性模型对该数据的解释力可能并没有那么高。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"27D121C21C28458ABF0C226A81124513","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在PyMC3中,一个简单的线性模型:\n","\n","1. 通过建立线性模型的概率表达形式来构建模型。\n","\n","2. 通过PyMC3对后验进行采样\n","\n","3. 通过Arviz对结果进行展示,辅助统计推断"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3BCD734597754FA295CB25508E37AB08","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["线性模型的一般形式:\n","\n","$y_i = a + b_1 x_1 + b_2 x_2 + \\epsilon$\n","\n","$y_i = \\hat{y} + \\epsilon$\n","\n","线性模型可以用概率的形式进行表达\n","\n","$y \\sim Normal(\\mu_i, sigma)$ -> y $\\sim$ Normal(mu,sigma)\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$ -> mu = alpha + $beta_1$ * $x_1$ + $beta_2$ * $x_2$\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n","\n","$\\beta_{i} \\sim Normal(0,1)$ -> $b_i$ $\\sim$ Normal(mu,sigma)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"594A90953A08482E9017EC6EDAB47BB4","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型构建"]},{"cell_type":"code","execution_count":49,"metadata":{"collapsed":false,"id":"5AA60A18E48C496E86417AE5077DE23B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )\n"]},{"cell_type":"code","execution_count":50,"metadata":{"collapsed":false,"id":"BDBAE07F4A664DFB88CB24EA654DCEB2","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":[""]},"execution_count":50,"metadata":{},"output_type":"execute_result"}],"source":["pm.model_to_graphviz(linear_model)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B4C51AB380284A11A4DC84B0360719E1","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["构建模型之后,我们可以进行模型拟合\n","\n","获得模型中未知变量的后验估计。考虑两个方法:\n","(1)使用优化方法找到参数的极大后验估计点(maximum a posteriori(MAP))\n","\n","我们也可以使用极大后验估计 MAP(pm.find_MAP)来进行优化。\n","\n","(2)使用MCMC采样方法获得后验分布来计算。\n","\n","在本例中,我们使用MCMC来进行采样,函数为pm.sample(),"]},{"cell_type":"code","execution_count":51,"metadata":{"collapsed":false,"id":"8B541AFB45C04C4E8A3DDCB827C91A5B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n"]}],"source":["#采样过程仍在该容器中进行\n","with linear_model :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"593F41C7DFED4E3484379DF571410E2A","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["该模型结构可以用可视化的形式进行表达。\n","\n","可以通过PyMC3自带的可视化工具将模型关系可视化。\n","\n","x1,x2为自变量。\n","\n","参数 $\\alpha$ 是线性模型的截距,而 $\\beta1$,$\\beta2$ 是斜率。\n","\n","参数$sigma$是残差,因变量为$y$。\n","\n","模型图展示了各参数通过怎样的关系影响到因变量。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"086FCE65863642818F65F37CC1C2ED9A","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 参数的后验分布\n","这里的模型分析结果展示了各参数的分布(后验)情况"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C80ABDA204E5475D8DD43175BCCC39ED","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["ArviZ的plot_trace会展示左右两幅图:\n","\n","左图描述了采样的各个马尔可夫链参数的后验分布,左图的横坐标为后验分布的取值范围,纵轴为概率密度。如果有两个参数在同一幅图中,就会用不同颜色的曲线表示\n","\n","右图描述了采样的各个马尔可夫链的参数的采样情况,右图的横坐标为每次采样次数,纵坐标为每一次采样的值"]},{"cell_type":"code","execution_count":52,"metadata":{"collapsed":false,"id":"574470394C65497C8E033E23521352D4","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["array([[,\n"," ],\n"," [,\n"," ],\n"," [,\n"," ]], dtype=object)"]},"execution_count":52,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制特定参数的采样情况,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_trace(trace,var_names=['alpha','beta','sigma'])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1767EC8EDF53443C837769B586B6818D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["ArviZ的plot_forest会展示森林图\n","\n","该图会展示不同后验分布的94%的可信区间,横轴为后验分布的取值范围,纵轴为不同参数。\n","\n","对于每一个参数,粗线展示了采样的各马尔可夫链的94%的可信区间,细线展示了参数的取值范围。\n","\n","若r_hat=True,右图会展示r_hat的值,该值通过比较模型参数的链间和链内估计值反映了采样的收敛性,r-hat值越接近1则收敛情况越好。"]},{"cell_type":"code","execution_count":53,"metadata":{"collapsed":false,"id":"CE023B55ABD54BF986FE8110902952C6","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["array([,\n"," ], dtype=object)"]},"execution_count":53,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制特定参数的森林图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_forest(trace,var_names=['alpha','beta','sigma'],r_hat=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2F843DEC768149AF8C026DEA4DB30728","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["ArviZ的plot_posterior绘制了参数的后验分布图\n","\n","该图的横轴为后验分布值,横线为94%可信区间,曲线高度代表概率密度,曲线上方展示了后验分布的平均值"]},{"cell_type":"code","execution_count":54,"metadata":{"collapsed":false,"id":"4048236000BE46C09AE141427572D4F2","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["beta0大于0的概率为0.99625\n","beta1大于0的概率为0.998\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_posterior(trace,var_names=['alpha','beta','sigma'])\n","# beta1和beta2的后验分布值大于0的比例,若大于95%则可认为对应的自变量对因变量有影响\n","print(f\"beta0大于0的概率为{(trace.posterior.beta[0] > 0).mean().values}\")\n","print(f\"beta1大于0的概率为{(trace.posterior.beta[1] > 0).mean().values}\")"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"249337D6BC3044D78B37CB71478A4D64","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["下表格为模型参数的基本信息:\n","\n","`mean`和`sd` 为各参数的均值和标准差;\n","hdi 3%-97% 为参数分布的可信区间的下限和上限\n","msce mean和sd 为mcmc采样标准误统计量的均值和标准差;\n","由于采样过程中存在自相关,不是所有样本都是有效的,ess bulk和tail 反应了mcmc采样有效样本数量相关性能,ess_bulk表明有效样本数的大小。\n","r_hat, $\\hat{R}$, 为参数收敛性的指标。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"25B6829ACB214D86B43E59156D54E007","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型诊断\n","\n","通过模型思维进行数据分析需要注意模型检验,即检验模型是否能有效的反应数据的特征。"]},{"cell_type":"code","execution_count":46,"metadata":{"collapsed":false,"id":"4D880D7A54474800B6644B5D3BA88FFB","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_bulkess_tailr_hat
alpha-0.3130.022-0.354-0.2710.00.03305.03167.01.0
beta[0]-0.0030.023-0.0430.0420.00.03337.02697.01.0
beta[1]0.7800.0230.7390.8250.00.02875.02759.01.0
sigma0.6850.0140.6590.7100.00.03048.02821.01.0
\n","
"],"text/plain":[" mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \\\n","alpha -0.313 0.022 -0.354 -0.271 0.0 0.0 3305.0 \n","beta[0] -0.003 0.023 -0.043 0.042 0.0 0.0 3337.0 \n","beta[1] 0.780 0.023 0.739 0.825 0.0 0.0 2875.0 \n","sigma 0.685 0.014 0.659 0.710 0.0 0.0 3048.0 \n","\n"," ess_tail r_hat \n","alpha 3167.0 1.0 \n","beta[0] 2697.0 1.0 \n","beta[1] 2759.0 1.0 \n","sigma 2821.0 1.0 "]},"execution_count":46,"metadata":{},"output_type":"execute_result"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.summary(trace,var_names=['alpha','beta','sigma'])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"91C989AC007646468B5B291113E8878B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 后验预测检验 ppc (posterior predictive check)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"23F7FAFFAD9441E6A165587F1BDC88D6","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["后验预测分布就是利用现有数据对未来数据进行预测,即\n","\n","$p(\\tilde{Y}|Y)=\\int_{\\Theta }^{}p(\\tilde{Y}|\\theta)p(\\theta|Y)d\\theta$"]},{"cell_type":"code","execution_count":83,"metadata":{"collapsed":false,"id":"01FB7D4CCD1F46A69B2A681269BEE308","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["\n","
\n"," \n"," 100.00% [4000/4000 00:06<00:00]\n","
\n"," "],"text/plain":[""]},"metadata":{},"output_type":"update_display_data"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n"," FutureWarning,\n"]}],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"38256AF71DC6480F918B4030CD281929","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["后验预测分布展示如下,横轴为观察到的数据,纵轴为概率密度,\n","\n","黑色实线为真实的观测值的分布,蓝色实线为根据不同后验分布的参数值模拟得到的预测分布,\n","\n","蓝色虚线为不同后验分布的参数值模拟得到的预测分布的平均值。"]},{"cell_type":"code","execution_count":84,"metadata":{"collapsed":false,"id":"5D25E7EC4D824B2F97EA73538D1CD1EA","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":84,"metadata":{},"output_type":"execute_result"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n"," func(*args, **kwargs)\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1958F26F212A4B50907AC7C4159FB897","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Part 2: Workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"FA6A7C7B83104BC48F42F8CE727E7454","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## $p(\\theta|data) = \\frac{p(data|\\theta)p(\\theta)}{{p(data)}}$\n","\n","贝叶斯公式只涉及到计算后验分布,那么我们上述的一系列流程有什么作用呢?\n","\n","这就是贝叶斯的workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B608751E5EB2494587F7A7AD5B4C6040","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 什么是workflow\n","\n","贝叶斯推理是对条件概率或者概率密度的计算。\n","\n","贝叶斯的workflow除了计算模型的后验分布外还包括研究问题,选择模型,选择先验,模型拟合,采样过程评估,模型诊断,模型比较,统计推断,结果报告。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E67B1D69223C4EC8ACC5F2A4563ADAB2","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 为什么我们需要workflow?\n","\n","一个完整的workflow是为了更好地解决问题。\n","\n","贝叶斯推断本身就是很复杂的推断过程,这一推断过程涉及到一些列的工作流程。\n","\n","如果不遵循这个过程,或者忽视这个过程中的一些环节,在推断的过程中就会出现遗漏,会对结果产生影响。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"86191C6973D847F18123084960BA8937","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### workflow能带给我们什么\n","\n","(1)在贝叶斯推理中,计算是很困难的。对于简单的模型,我们可以直接计算其后验分布。但是对于复杂的模型,我们需要使用一系列近似算法,例如如MCMC(蒙特卡洛马尔可夫链)方法,并对拟合过程进行检验\n","\n","(2)一般情况下,我们不会提前知道我们想要的合适的模型,即使提前选择了可接受的模型,我们也会希望在收集更多数据后找到更合适的模型\n","\n","(3)即使我们得到了合适的模型并且拟合效果很好,但我们仍想要知道你和的模型和数据之间的关系。\n","\n","(4)有时候,不同的模型会产生不同的结果,呈现多个模型有助于说明模型的不确定性。\n","\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A8A55F55BB9048D78F598358F70E899F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### workflow的内容有哪些"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A468CCDE3A22462A9C6851929B064811","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["![Image Name](https://cdn.kesci.com/upload/image/rki1v9nkk5.png?imageView2/0/w/960/h/960)\n","\n","(Gelman et al. 2020)\n","\n","这是Gelman在2021提出的workflow,该workflow比较完整,展示了模型分析过程中可能出现的步骤,出现的问题和相应的调整方法。\n","\n","Gelman的workflow会涉及到对计算的检查,会通过一系列的方法来验证计算过程是否是有效的,比如模拟假数据\n","\n","但这个workflow涉及到的过程过于全面,超出课程的安排,故我们采用了相对简化一点的workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B6D377C2484649518912D85F50ED6378","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["基于Martin等人2021年Bayesian Modeling and Computation in Python中提出的workflow,我们提出了一个简化版本的workflow,该workflow主要涉及这些部分:\n","\n","研究问题,选择模型,选择先验,模型拟合,采样过程评估,模型诊断,模型比较,统计推断,结果报告。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5CE0698F4DA0446DABC37AAA63C7325B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["![Image Name](https://cdn.kesci.com/upload/image/rkkf9il8hy.png?imageView2/0/w/960/h/960)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"397C38F802D946B18FF028819B9553E6","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["接下来我们将对这个work的每一个部分做一个详细的介绍。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9E0AFEC0AD4A44858B7E5BB9C68D2560","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 研究问题\n","\n","在进行分析之前,最重要的就是先明确我们的问题,并决定这个问题是否要使用贝叶斯的方法来进行回答。\n","\n","如果我们面临这样一个情景,我们想要抑郁症患者的认知控制能力与健康人有无差异,我们在收集到数据之后,我们是否需要用到贝叶斯的workflow?\n","\n","我们可以对这批数据进行统计推断,用贝叶斯的workflow就把简单问题复杂化了。\n","\n","我们首先需要明确我们要解决的问题是适用于贝叶斯方法,或者有使用贝叶斯的方法的必要的,进行workflow才有意义。\n","\n","我们需要根据我们的研究问题进行设宴设计或者调查以获得我们想要的数据。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"126E4D050D63401AB064A5C7592B4489","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["回到我们的问题:\n","\n","社交网站的使用可能在知识共享和个体创新行为中起到了关键作用,而个体在使用社交网站时往往能获得在线社会支持,可以帮助个体提高对自身创新行为产生的能力和信心的评估——即创新自我效能感。\n","\n","假设我们将探究大学生的社交网站使用和创新的自我效能感对个体创新行为的关系。\n","\n","我们首先需要收集能够测量大学生这三种心理特质的工具,并收集到相应的数据,我们才可以进行接下来的而数据分析。\n","\n","大学生的社交网站使用和创新的自我效能感对个体创新行为的关系是未知的,需要通过尝试不同的模型,来试图找到他们之间的最佳的关系,这就需要用到贝叶斯workflow的内容。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A13FECCB2B3E4865959B968D142672D8","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择模型\n","\n","我们需要选择合适的模型来对数据进行解释,这个过程是可以迭代的,我们不可能一开始就找到对数据解释的最好的模型,但我们可以通过workflow后面的步骤来对模型的效果进行检查,这也是workflow的好处之一。我们可以从后续步骤的反馈会过来对模型成分进行调整。\n","\n","在贝叶斯workflow里,我们需要用概率的形式对模型进行表达。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"51B92657AB92490C801BBD6F83E04231","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["由于我们想要探究的问题涉及到一个因变量和两个自变量,我们可以构建最简单的线性模型。\n","\n","线性模型的一般形式:\n","\n","$\\hat{y} = a + b_1 x_1 + b_2 x_2$\n","\n","$y_i = \\hat{y} + \\epsilon$\n","\n","线性模型可以用概率的形式进行表达:\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$\n","\n","\n","$y \\sim Normal(\\mu_i, \\sigma)$ \n","\n","这就是模型的似然\n","\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5E98B54E6D0B42E481E0C244A5CB6CBD","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择先验 \n","\n","$p(\\theta|data) \\propto p(data|\\theta)p(\\theta)$\n","\n","在贝叶斯模型中,除了模型的似然,我们还要选择模型的先验。\n","\n","在lecture5我们提到不同的先验和数据的结合会影响后验分布,所以在选择先验时,我们会根据以往研究来确定先验,或者通过先验预测检查来对确定先验。我们可以根据检查的结果或者workflow后续步骤的结果活过来对先验进行调整。\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"83A37C0E974A4F169484E7875019B096","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["该模型共涉及到四个参数,其中$\\alpha$ $\\beta_1$ $\\beta_2$为正态分布的均值的超参数,$\\sigma$为正态分布的参数,这既是我们的模型需要推断的对象,也是模型的先验\n","\n","我们选择的先验为:\n","\n","$\\alpha \\sim Normal(0,1)$ \n","\n","$\\beta_{i} \\sim Normal(0,1)$ \n","\n","$\\sigma \\sim HalfNormal(1)$ \n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"53A6B7B674E1490B8BC6D1B6702AC418","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["##### 先验预测检查(Prior Predictive Check)\n","\n","在实际工作中,当做出先验假设(可能是不合适的)后,无需以观测数据为条件,就可以通过从模型中采样来获得先验预测分布( Prior Predictive Distribution )。\n","\n","这种利用先验和模型生成样本,并用样本来评估先验的过程,被称为先验预测检查。"]},{"cell_type":"code","execution_count":17,"metadata":{"collapsed":false,"id":"29A45A9370D245DB9C39F20B569C34DC","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":18,"metadata":{"collapsed":false,"id":"30296BFEEE5645A28874073EA0A0C4C1","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," \n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["fig = plt.figure()\n","ax = Axes3D(fig) #讲画布对象改为3d对象\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","x2 = np.linspace(-3, 3, 50)\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b[0] * x1 + b[1]*x2 #基于假数据生成预测值\n"," ax.plot(x1,x2,y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3EDECF0064FB45EC80391254620FED04","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们可以看到,基于参数的先验生成的因变量的取值范围在-4-6之间,与真实情况相差不大,可以接受。\n","\n","若我们改变一下先验,则会看到因变量取值范围在-100-100之间,这与真实情况相差较大,故需要调整"]},{"cell_type":"code","execution_count":19,"metadata":{"collapsed":false,"id":"B525EE2480444110805F0EE804148DDD","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=10)\n"," beta = pm.Normal('beta',mu=0,sd=10, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":20,"metadata":{"collapsed":false,"id":"D351E620DF2C472D8F4A1C77757BDEA3","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," \n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["fig = plt.figure()\n","ax = Axes3D(fig) #讲画布对象改为3d对象\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","x2 = np.linspace(-3, 3, 50)\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b[0] * x1 + b[1]*x2 #基于假数据生成预测值\n"," ax.plot(x1,x2,y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F1D9765D062A49C991A5161E94AF7852","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型拟合\n","\n","在我们设置好模型的内容之后,我们就可以将概率的形式转化成代码的形式,并用合适的模型拟合方法来对参数进行推断。\n","\n","\n","$y \\sim Normal(\\mu_i, sigma)$ -> y $\\sim$ Normal(mu,sigma)\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$ -> mu = alpha + $beta_1$ * $x_1$ + $beta_2$ * $x_2$\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n","\n","$\\beta_{i} \\sim Normal(0,1)$ -> $b_i$ $\\sim$ Normal(mu,sigma)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"5044BDED17C449FD86EF96403AE20133","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["#首先,载入pymc3这个包,将其命名为pm\n","import pymc3 as pm\n","# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"6E47BD0579234CCC934B06DFAD1A0E06","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 采样过程评估\n","\n","在pymc中,我们一般会采用mcmc来对模型进行采样。\n","\n","如果采样过程不理想,模型的参数没有收敛,我们就会调整采样过程的参数,如果一直没有收敛,我们就会回到之前的步骤,选择更合适的模型和先验。\n","\n","对采样过程的评估我们会采用目视检查或rhat这个指标。"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"C76B6F098EAC4D62BF5D505636AA6F1B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["#采样过程仍在该容器中进行\n","with linear_model :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C1FD8EC84AB54403818131D461CFA822","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型诊断\n","\n","在采样过程有效的情况下,我们可以通过有效样本计算出模型的后验分布,但该后验分布只是该模型下贝叶斯推断的结果,我们还要根据后验分布来检查模型是否真的可以对数据进行解释。\n","\n","我们会使用后验预测分布通过我们得到的参数生成一批模拟数据,并将其与真实数据进行对比。\n"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"804F825F72A247E08A4BCBF28EB3F486","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"D64E39385D5F4FA287F0F18BEBDCFF51","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"DF6ABD4C903843E58BCAA47C9F462669","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型比较\n","\n","到目前为止,我们已经使用后验预测检查来独立评估每个模型。\n","\n","这种类型的评估对于单独理解每个模型很有用。然而,当我们有多个模型时,这就引出了模型相对于彼此的性能如何的问题。\n","\n","模型比较可以进一步帮助我们了解一个模型在哪些区域可能表现良好。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D6227F196049426D8D1CD781A1E5F577","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["首先,我们构建一个新的模型,假定该模型中其他条件均不变,观测值服从gumbel分布"]},{"cell_type":"code","execution_count":28,"metadata":{"collapsed":false,"id":"59E74B06B3FA45B6ADE05C127090DF41","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model_2:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," #先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":33,"metadata":{"collapsed":false,"id":"92201AFA5EC745EF85B6C9586430E4EF","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model_2:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )"]},{"cell_type":"code","execution_count":34,"metadata":{"collapsed":false,"id":"A589CB367D3B4836A3241805C1146F01","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":true,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 6 seconds.\n"]}],"source":["#采样过程仍在该容器中进行\n","with linear_model_2 :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace2 = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"code","execution_count":35,"metadata":{"collapsed":false,"id":"20865B22396941C9BE7D8F34CF5FED52","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["\n","
\n"," \n"," 100.00% [4000/4000 00:03<00:00]\n","
\n"," "],"text/plain":[""]},"metadata":{},"output_type":"update_display_data"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n"," FutureWarning,\n"]}],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model_2:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace2.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace2, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"code","execution_count":36,"metadata":{"collapsed":false,"id":"587EA94AF48A4E0A9B40105776BF61BD","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":36,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace2)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A4C90CCF7AC444F68B77371040279A81","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["对于两个模型哪个更好。从我们的视觉检查来看,似乎不能直接看出区别。我们可以使用 ArviZ 中的比较方法来比较这一观察结果。"]},{"cell_type":"code","execution_count":38,"metadata":{"collapsed":false,"id":"ED3A3BAB4A4946B68344345612128C90","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/stats/stats.py:146: UserWarning: The default method used to estimate the weights for each model,has changed from BB-pseudo-BMA to stacking\n"," \"The default method used to estimate the weights for each model,\"\n"]},{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
rankloop_lood_looweightsedsewarningloo_scale
normal0-1099.30510710.1290800.0000001.000000e+0037.2619920.000000Falselog
gumbel1-1099.46748110.2252030.1623741.110223e-1637.2819630.102939Falselog
\n","
"],"text/plain":[" rank loo p_loo d_loo weight se \\\n","normal 0 -1099.305107 10.129080 0.000000 1.000000e+00 37.261992 \n","gumbel 1 -1099.467481 10.225203 0.162374 1.110223e-16 37.281963 \n","\n"," dse warning loo_scale \n","normal 0.000000 False log \n","gumbel 0.102939 False log "]},"execution_count":38,"metadata":{},"output_type":"execute_result"}],"source":["compare_dict = {\"normal\": trace,\"gumbel\": trace2}\n","comp = az.compare(compare_dict, ic=\"loo\")\n","comp"]},{"cell_type":"code","execution_count":39,"metadata":{"collapsed":false,"id":"A73CC6CFA6AE45D19D2C5C6CBE6E5B1E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":39,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_compare(comp)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"94C1B94B3EDB4FD49520D031B24FA8D7","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们一般会选择loo值低的模型作为最佳模型,除了loo外还有AIC。BIC等指标,我们会在以后的课程中给大家讲解。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"16845DCCB046485C894289D6EEB1D028","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 统计推断"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"7FDDA4F5AF2F4DB88DE828B84558DC09","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在选择最佳模型之后,我们就会根据最佳模型进行统计推断。\n","\n"]},{"cell_type":"code","execution_count":55,"metadata":{"collapsed":false,"id":"D81CEF1C4A87484A870F070B02775D05","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["beta0大于0的概率为0.99625\n","beta1大于0的概率为0.998\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_posterior(trace,var_names=['alpha','beta','sigma'])\n","# beta1和beta2的后验分布值大于0的比例,若大于95%则可认为对应的自变量对因变量有影响\n","print(f\"beta0大于0的概率为{(trace.posterior.beta[0] > 0).mean().values}\")\n","print(f\"beta1大于0的概率为{(trace.posterior.beta[1] > 0).mean().values}\")"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C854CF377ADF43DA8C79BCF8030A6FAF","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["由于beta1和beta2的采样值有超过99%都大于0,于是我们可以认为这两个变量对因变量产生了影响。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"18EE46DC040E4DB68495C62825F21E8B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 结果报告"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F887E783CEC9450B8A2C327D9526271D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["大学生的社交网站使用和创新的自我效能感会个体创新行为产生显著的正向影响。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"96A6C7070FAE422AB5F82CCDD17F75CC","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Practice\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"7C05B198BBC449B888819BE0C3BFC025","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"CA4D8ACF66FA4B41B8670988B0E2222B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 研究问题"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"41F7BA7522764747AC1D5483CF3D3577","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["还有研究发现个体创新行为可能与自尊水平有关。\n","SES_t:自尊水平;\n","EIB_t创新行为\n","\n","试探究两者之间的关系。"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"3E1441473A0044AB82E2AD07EECF6CD6","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["data['SES_t'] = (data['SES_t'] - data['SES_t'].mean()) / data['SES_t'].std()#将变量进行标准化\n","\n","data['EIB_t'] = (data['EIB_t'] - data['EIB_t'].mean()) / data['EIB_t'].std()#将变量进行标准化\n","\n","plt.scatter(data['SES_t'],data['EIB_t'])\n","plt.xlabel('SES_t')\n","plt.ylabel('EIB_t')"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"DCC7B6BBFB09450F80088400F0F96F2C","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择模型"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"BB2E6CBC0F7A48DEBD859F368A466CA7","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["尝试构建一个简单的线性模型\n","\n","线性模型可以用概率的形式进行表达\n","\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n","\n","$\\beta \\sim Normal(0,1)$ -> b $\\sim$ Normal(mu,sigma)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n","\n","$\\mu_i = \\alpha + \\beta *x$ -> mu = alpha + beta*x \n","\n","$y \\sim Normal(\\mu_i,sigma)$ -> y $\\sim$ Normal(mu,sigma)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"8DB0F9B932D940B786C3617DD25C2EDF","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择先验"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"BD6BE331DE634FE785A6060175F429B3","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","# 请尝试用代码表达模型的先验\n","########################################################\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = ...\n"," beta = pm.Normal('beta',mu=0,sd=1) \n"," sigma = ...\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"1ADDD4F654D048728CD0B7DA51418D33","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["fig = plt.figure()\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b * x1 \n"," ax.plot(x1,y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9F6272FA34C240859DDA51BC24A19A4C","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型拟合"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"6C343DCE8B7040B784EA97742217B91F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","# 请尝试表达自变量与预测值的线性关系,并将...替换为表达式\n","########################################################\n","linear_model = pm.Model()\n","with linear_model :\n"," #\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1,shape=2)\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," #\n"," x = pm.Data(\"x\", x)\n"," #\n"," mu = pm.Deterministic(\"mu\", ...) \n"," #\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"79E27FECC9704144BB0266A0494AF5C5","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 采样过程评估"]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"E2C2D915ED1044C095AAA524FBF9D793","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","# 请尝试调用pm.sample()函数,用mcmc进行采样\n","########################################################\n","..."]},{"cell_type":"code","execution_count":null,"metadata":{"collapsed":false,"id":"F7D6B1475987489F9F7C1EC643A01B42","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","# 请尝试调用pm.plot_trace()函数,检查模型收敛情况\n","########################################################\n","..."]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"05079881F3E940158D5E45BC4E71F278","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Part 3: MCMC Diagnostic"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"6628683531E94FAB84E167AFDD6E2B76","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在前面我们已经了解到,如何通过概率统计语言 PyMC3 来帮我们实施 MCMC 算法。\n","```python\n","with model:\n"," trace = pm.sample(draws = 2000, chains=2)\n","```\n","比如在上述代码示例中,我们对贝叶斯模型采样2000个样本,并同时运行2条马尔科夫链。\n","\n","我们凭借这2000个样本来代表我们的目标分布(后验分布)。\n","\n","**而问题在于:**\n","- 这2000个样本真的能代表目标分布吗?\n","- 我们如何检测MCMC的样本是否存在问题?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2D5ACE0A00FF40FF89211B7271AD0B97","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["注意:\n","- PyMC 的结果 trace 中包含多个 MCMC chain\n","- 每个 chain 中可能包括多个参数\n","- 每个参数的采样都可以代表为一个参数后验分布(目标分布)\n","\n","为了方便,接来的讨论的**单个chain**都只包含**单个参数**。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"346F2A34864F4437BCEBE4D049187B6F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 什么是 MCMC diagnostic?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1FC4E15F61524D8FABB2F7D2DB347454","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了检测 MCMC 样本的质量与问题,我们需要一些诊断工具,这些诊断工具统称为:马尔可夫链蒙特卡罗诊断(Markov Chain Monte Carlo diagnostics),简称 MCMC diagnostic。\n","\n","MCMC diagnostic 可以用来检查用MCMC算法采样得到的样本是否足以提供目标分布的精确近似。\n","\n","MCMC diagnostic 的目的:\n","1. 发现 MCMC 样本存在的问题。\n","2. 根据诊断对 MCMC 算法进行修正(fix)。\n","\n","MCMC diagnostic 检查的具体内容:\n","- MCMC样本的大部分采样是否来自目标分布(target distribution)。\n","- MCMC样本量是否足够近似目标分布。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5DA3DAC5DA184FEDB5A80B505FB14CD4","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 样本的两大问题"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9F02AC9DA9B547898DF15081B3A11A8F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["要了解 MCMC diagnostic 检测到底是什么,首先我们需要知道 MCMC 样本最常见的问题。\n","\n","MCMC 样本最常见的两大问题:\n","1. bias: 样本是否代表整个后验分布?\n","2. precision: 是否有足够的样本来获得足够精确的统计推断?\n","\n","这两个问题涉及到 MCMC 诊断的两个专有概念:收敛性与有效样本量。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1302A2D8C6874E7F8032E14C26960DDD","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["1. 收敛(convergence)问题。\n"," \n"," 由于马尔科夫链的性质,无论 MCMC 算法的起始值(initial or starting value)和起始分布为何,随着采样的进行,样本最终都应该接近目标分布。\n","\n"," 而由于各种问题(比如,代码错误,或者先验定义错误),样本无法接近目标分布,我们称为 MCMC chain **无法收敛**。见下图2。\n"," \n"," 即使 MCMC chain 可以收敛,如果起始值离目标分布太远,那么它收敛所需的时间也更长,见下图3。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"149D8AF248DC45A18BAD6C29E5029F3A","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["2. 有效样本量(effective sample size)过于少的问题。\n","\n"," 可以想象,当样本量越多,样本分布越接近目标分布,但是当样本数量过少时,可能对目标分布产生错误的推断。那什么因素会导致采样样本的数量变少?\n"," \n"," 由于马尔科夫链的性质,当前采样 $\\theta_{t}$ 依赖于 上一次的采样 $\\theta_{t-1}$。因此,样本间的关系不是独立的,这违反了独立采样的假设。违反独立采样的后果可能就是 MCMC chain 无法收敛,如下图2。\n","\n"," 可以想象,如果 $\\theta_{t}$ 与 $\\theta_{t-n}$ 的间隔 n 越大,两个样本的相关性越低。\n","\n"," **通过每隔 n 保留一个采样可以减少样本相关性的影响,但代价就是有效样本量变少**。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F8A772D4D6AC40EDACCD77D480D2814E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["图1:参数收敛时MCMC采样的样本\n","\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijyv4px4.png?imageView2/0/w/300/h/640)\n","\n","\n","图2:参数不收敛时MCMC采样的样本\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijynzzao.png?imageView2/0/w/300/h/640)\n","\n","图3:起始值过于远离目标分布时的MCMC采样\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/300/h/640)\n","\n","source: https://statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo-diagnostics"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"273493C6105540A68FEFE0D46A6919FB","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 视觉检查 (Visual Diagnostics)\n","\n","检查 MCMC 采样质量最直观的方式就是视觉检查 (Visual Diagnostics)。\n","\n","上面的采样轨迹图就是视觉检查的一种形式。所以我们对此并不陌生。\n","\n","常见的视觉检查有两类:\n","- 轨迹图 Trace plots。即上面的图。\n","- 自相关函数图 Autocorrelation function (ACF) plots。\n"," \n"," 前面提到样本间存在相关性,而这种自相关性会导致轨迹图“不收敛”。而不收敛的情况有很多,并不能从轨迹图中得知是因为自相关导致了不收敛,因此我们需要自相关函数图来诊断是否存在自相关的问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"EC5671E8ED8F40AEA5E68A4218E09813","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["接下来我们结合代码示例进行演示。\n","\n","以前使用过的 `arviz` 工具包就可以完成大部分的诊断工作。所以接下来的分析我们会依赖于 arviz。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"ACCCCAC29CB645618E1ED48C5A4BFCF8","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了演示上述不同问题对于 MCMC 采样的影响,我们生成了三种 MCMC chain:\n","\n","- good_chains 包含两条 MCMC chain,生成自一个 $\\beta$ 分布,并且两次采样间是完全独立的。\n","- bad_chains0 在 good_chains 的基础上,加入了采样间的自相关。\n","- bad_chains1 在 good_chains 的基础上,在采样的局部加入高度自相关。bad_chains1 代表了一个非常常见的场景,采样器可以很好地解析参数空间的一个区域,但是很难对一个或多个区域进行采样。\n","\n","\n","> 代码源自:2.4 Diagnosing Numerical Inference in Martin, from O. A., Kumar, R., & Lao, J. (2021). Bayesian Modeling and Computation in Python (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003019169\n"]},{"cell_type":"code","execution_count":70,"metadata":{"collapsed":false,"id":"1EE9394D835A4192A7CBF3183F831FCB","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["good_chains = stats.beta.rvs(2, 5,size=(2, 2000))\n","bad_chains0 = np.random.normal(np.sort(good_chains, axis=None), 0.05, size=4000).reshape(2, -1)\n","bad_chains1 = good_chains.copy()\n","for i in np.random.randint(1900, size=4):\n"," bad_chains1[i%2:,i:i+100] = np.random.beta(i, 950, size=100)\n","\n","chains = {\"good_chains\":good_chains,\n"," \"bad_chains0\":bad_chains0,\n"," \"bad_chains1\":bad_chains1}"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C86B1786279149F380E500F097EFEA4F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Trace Plots"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"EF0ADBBEDF0641528AA2AFBD1CCBC6AE","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["轨迹图(trace plot)是贝叶斯中最常见的诊断图,横坐标为每个迭代步骤,纵坐标为采样值。\n","\n","从这些图中,我们能够看到 \n","- MCMC chain 是否收敛。\n","- 同一个参数每条链是否收敛为相同的分布。\n","- 初步观察自相关程度。\n"," \n","在ArviZ中,通过调用函数az.plot_trace(.),可以在右侧得到轨迹图,在左侧得到参数分布图。"]},{"cell_type":"code","execution_count":71,"metadata":{"collapsed":false,"id":"96D4C029B1C44AB18FBF4D038AEF4EB5","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_trace(chains)\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E009BE84FE954264B8F656D1B16AA851","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,\n","- good_chains 的表现最好,并且其中的两条 chain 几乎重合。\n","- bad_chains0 并不收敛,其中的两条 chain 并不趋向于同一个目标分布,并且两者有持续扩大的趋势。\n","- bad_chains1 虽然收敛,但是采样不均匀,即多个波峰(左图),表现为在局部过多采样(右图)。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"FD79A5834996434BA95F0722BF19486B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["此外,如前面提到的,对于多条 MCMC 链,trace plot 还可以观测到不同初始值对采样的影响。\n","\n","![](https://bcss.org.my/tut/wp-content/uploads/2020/04/rw3.gif)\n","\n","source: https://bcss.org.my/tut/bayes-with-jags-a-tutorial-for-wildlife-researchers/mcmc-and-the-guts-of-jags/mcmc-diagnostics/\n","\n","比如,由于初始值过于极端,导致需要花费多次迭代才能达到平稳分布。\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/400/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"85036891737B422A80E275DB2FD0BB93","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Autocorrelation Plots\n","\n","正如我们在讨论有效样本容量时所看到的,自相关减少了样本中包含的样本量。\n","\n","因此我们希望自相关尽量小的同时样本量足够多。\n","\n","我们可以用`az.plot_autocorr`绘制自相关图(autocorrelation plot)检查自相关性。"]},{"cell_type":"code","execution_count":72,"metadata":{"collapsed":false,"id":"030E1856D1A341C980ED89F2A7D7C50B","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_autocorr(chains, combined=True) # combined=True表明将两条chains合并为1条\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"94D1AEE551D34A039D67CCCA266064F9","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["上图为自相关函数在100步间隔窗口上的条形图。\n","- 横坐标为 $\\theta_{t}$ 与 $\\theta_{t+n}$ 的间隔 n,该间隔越大,两个样本的相关性越低\n","- 纵轴为 $\\theta_{t}$ 与 $\\theta_{t+n}$ 的相关性。\n","- 灰色带表示所有相关系数形成的95%置信区间。\n","\n","该图结果显示:\n","- good_chains 的纵向高度接近于零(并且大部分在灰色带内),这表明自相关性非常低。\n","- bad_chains0 和 bad_chains1 中的高条表示自相关性值较大。\n","- bad_chains1 随着 n 的间隔增大,自相关减少。而 bad_chains0 不会随着 n 增大减少,说明该 MCMC chain的自相关问题非常严重。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"55C6613FD27B4C099F25A08F453C8A75","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了更好的理解自相关系数(y轴)如何计算,以及 间隔n对自相关的影响。\n","\n","我们可以自定义一个函数:\n","- 首先,取出 $\\theta_{t}$\n","- 然后,取出 $\\theta_{t+n}$\n","- 最后计算两者的相关系数\n","\n","需要注意,在当前的例子中,t的总长度为2000,每当 n 增加1,$\\theta_{t}$ 和 $\\theta_{t+n}$ 的长度就会减少1。因为间隔不可能超过2000。"]},{"cell_type":"code","execution_count":73,"metadata":{"collapsed":false,"id":"5D48AE1C7B2149C386030332AB70CAC0","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["def autocorr(lag, chain):\n"," n = lag # 定义间隔n的大小\n"," t = 2000 # 单条 chain的总长度\n"," theta_t = chain[0:(t-1-n)] # 提取 theta_t\n"," theta_tn = chain[n:(t-1)] # 提取 theta_t+n\n"," rho = np.corrcoef(theta_t,theta_tn) # 计算相关系数\n"," print(\"间隔为\", n, \"时,的相关系数\", rho.round(2)[0][1]) # 输出结果"]},{"cell_type":"code","execution_count":74,"metadata":{"collapsed":false,"id":"14515840622A4EC08740F93190D2AB7E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 1 时,的相关系数 0.02\n","间隔为 1 时,的相关系数 0.64\n","间隔为 1 时,的相关系数 0.31\n"]}],"source":["# 间隔 n = 1时,三条链的自相关性\n","n = 1\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":75,"metadata":{"collapsed":false,"id":"99760EF9D2EF473E8DFEDCE018F9F998","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 2 时,的相关系数 -0.0\n","间隔为 2 时,的相关系数 0.65\n","间隔为 2 时,的相关系数 0.29\n"]}],"source":["# 间隔 n = 2时,三条链的自相关性\n","n = 2\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":76,"metadata":{"collapsed":false,"id":"0A3B2F6F13EB4B9992DC9CCD3F4E432D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 3 时,的相关系数 0.01\n","间隔为 3 时,的相关系数 0.64\n","间隔为 3 时,的相关系数 0.29\n"]}],"source":["# 间隔 n = 3时,三条链的自相关性\n","n = 3\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":77,"metadata":{"collapsed":false,"id":"CB96E3612D684E9A87B90714369E5FA8","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 50 时,的相关系数 -0.02\n","间隔为 50 时,的相关系数 0.63\n","间隔为 50 时,的相关系数 0.11\n"]}],"source":["# 间隔 n = 50时,三条链的自相关性\n","n = 50\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2679BE052C3A4934B573E2D6E2B15305","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,结果与 autocorrelation的图结果一致:\n","- good_chains 的结果始终在0附近,表明自相关低。\n","- bad_chains0 的结果显示,当间隔n扩大到50时,自相关系数从0.64下降到0.63。\n","- bad_chains1 的结果显示,当间隔n扩大到50时,自相关系数从0.15下降到0.08附近。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A44FD4DE93764E119701A77FDAB54B82","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["上述计算过程可以通过 Arviz自带的 `autocorr`函数进行计算。"]},{"cell_type":"code","execution_count":78,"metadata":{"collapsed":false,"id":"DC9C496993384C6588F8A3B59C12A32E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["good_chains的相关系数: [ 0.02 -0. 0.01 -0.02]\n","good_chains的相关系数: [0.64 0.64 0.63 0.58]\n","good_chains的相关系数: [0.3 0.29 0.29 0.11]\n"]}],"source":["n = [1,2,3,50] # 定义间隔为 1,2,3 和 50\n","rhos = az.autocorr(chains[\"good_chains\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)\n","rhos = az.autocorr(chains[\"bad_chains0\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)\n","rhos = az.autocorr(chains[\"bad_chains1\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"99C77C10115F478E8FC1881A7BE3D8F2","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过轨迹图和自相关图,我们可以诊断出 MCMC chain 中存在的不同问题。\n","\n","但自然而然能想到的两个问题是:\n","- 通过视觉方法判断 MCMC 的问题是否主观?能否通过客观的指标进行诊断?\n","- 发现了 MCMC 的问题,我们如何进行修正?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"AD3956B1142043B48D16C9975E3DFD83","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 诊断指标"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"CD674A92366B400F8F48277FF7C5CBE1","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过绘图对MCMC进行检查的优势在于直观,并且能通过图推测问题在什么地方。\n","\n","但对于许多参数的模型来说,挨个检查图形是非常困难的。\n","\n","因此,通过数值进行判断一方面可以提高诊断的客观性,另一方面提高了诊断的效率。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"BCB9A10C863947548A3031FC4DAA4BAA","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["针对 MCMC 的两大问题,这里存在两个主要的指标:\n","1. 针对收敛问题的:Rhat,$\\hat{R}$,也称为“潜在规模缩减因子” (Potential Scale Reduction Factor, PSRF)。它将单个链样本的变异与混合了所有链样本的变异进行比较。\n","2. 针对有效样本量问题的:effective sample size (ESS) or effective number of draws (n.eff)。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"41F018CE8BA840C69756E8789F3C638D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Convergence and Rhat"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"114B51D6E0654D329CB4530B46B5C56F","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["前面提到,收敛(convergence)描述了 MCMC 样本两个层面的问题:\n","1. MCMC 采样是否从初始值“收敛”到一个稳定的分布。\n","2. 即使链收敛到了一个稳定的分布,多条链形成的稳定分布是否相似。\n","\n","反面例子就是 bad_chains0: \n","\n","该采样既没有收敛到一个稳定的分布(右图中采样整体向上偏移),并且两个链形成的分布也不相同(左图)。\n","![Image Name](https://cdn.kesci.com/upload/image/rkimd3udb3.png?imageView2/0/w/640/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"331F0B70D62146A186AF06D0787BE05C","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat的作用就是通过一个指标反应上述两个层面的问题。\n","\n","具体逻辑为:\n","- 运行多条 MCMC 链,并且以不同的初始值开始采样。\n","- 比较每条链的相似性。\n","\n","数学逻辑为:\n","- 计算θ的所有样本的标准差,包括所有链的样本\n","- 计算每条链标准偏差,并通过平方平均数整合到一起\n","- 比较两个标准差的大小。如果两个的除数为1,说明两个变异相同,表明他们存在相似性。\n","- 如果,该值大于1.01,说明两个变异可能存在差异,即参数可能不收敛。\n","\n","source: \n","- https://www.r-bloggers.com/2020/01/applied-bayesian-statistics-using-stan-and-r/"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F82C485B084E42E5A4F9E3047359842D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["假设,参数 $\\theta$ 有m个 chain 和 n个 样本。\n","\n","1. 用 $s_m^2$ 表示每条链内部的方差。那么链内部的变异 within-chain variance $W$ 为所有链方差的均值 $W = \\frac{1}{M} \\, \\sum_{m=1}^M s_m^2$\n","2. 用 $\\bar{s}_m^2$ 表示所有链均值计算的方差。链间的变异 between-chain variance $B$ 为 $B = \\bar{s}_m^2$。\n","3. 结合在一起得到 Rhat $\\hat{R} = \\sqrt{\\frac{W+B}{W}}$\n","\n","可见,原理和方差分析类似,如果链间变异B趋近于0,那么Rhat趋近于1,说明各链的均值几乎相同,以此反应他们的相似性。\n","\n","这里的数学公式有所简化,详细请看 https://mc-stan.org/docs/reference-manual/notation-for-samples-chains-and-draws.html。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B39B75E35ECB42EF8C3701EBC9615758","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat计算的条件是有多条 MCMC 链并且尽可能的每条链使用不同的起始值。\n","\n","我们可以通过 `az.rhat(.)` 简化计算过程。"]},{"cell_type":"code","execution_count":79,"metadata":{"collapsed":false,"id":"803C707461014C6A8F6A096D4ABBFB15","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 1.0\n","    bad_chains0  float64 2.406\n","    bad_chains1  float64 1.031
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 1.0\n"," bad_chains0 float64 2.406\n"," bad_chains1 float64 1.031"]},"execution_count":79,"metadata":{},"output_type":"execute_result"}],"source":["az.rhat(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5F7D0041F530481882C769AC4BFA798A","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["从这个结果中我们可以看出,\n","- good_chains 的 Rhat 接近1,表示该参数的多条链都是收敛并且相似的。\n","- bad_chains0 的结果非常糟糕,这可以结合轨迹图得到佐证。\n","- bad_chains1 的结果相对较好,但其Rhat仍然大于1.01,说明该参数的收敛程度差,这点通过轨迹图有时难以发现其存在的问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"60AAA67CF55A4753A4E96F071BC54C32","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat的最低要求是低于1.1,但现在大多数发表的文章要求小于1.01。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B8B5604F23884ABDA58C1DFC82936340","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 有效样本量 effective sample size (ESS) 与自相关 autocorrelation"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E266E5EF75AA45E3BB92525997BCF16E","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["前面通过自相关函数图发现,bad_chains1 会随着 n 间隔增大而自相关减少。\n","\n","但这样做的代价是可用的样本数量变少。\n","\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkimhrvf7h.png?imageView2/0/w/640/h/640)\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"06449FBAF2084318A211A85E700D98B7","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["因此,我们是否能通过一个指标反应,在满足最小自相关的条件下,即只保留隔n个采样的样本时,可用的样本还剩下多少?\n","\n","这就是 有效样本量(effective sample size,ESS)指标的意义。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"09519EA6829E40FB863561B1A9C6D226","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["ESS计算的逻辑为:\n","- 在样本量为n的 MCMC链中,分别计算 1到n 采样间隔下的样本相关性 $\\rho$,这样可以计算得到,n个 $\\rho_{n}$。\n","- 如果样本间自相关越大,$\\rho$ 越接近1,相反则接近0。因此,n个 $\\rho_{n}$ 相加,其值越接近n,说明自相关越大。\n","- 结合这个想法,用 n 比上 $\\rho_{n}$ 的求和即可得到有效样本量。\n"," \n"," $ESS = \\frac{N}{\\sum_{t = -\\infty}^{\\infty} \\rho_t}$\n","\n"," 如果自相关特别大,那么 ESS 接近1,如果自相关特别小,那么ESS接近n。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5F7892F715384BA1BBDC6218A481F9FA","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们可以通过 `az.ess(.)` 函数计算 ESS。"]},{"cell_type":"code","execution_count":80,"metadata":{"collapsed":false,"id":"24E130C91D1F4DE4B518EBE072306954","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 3.94e+03\n","    bad_chains0  float64 2.436\n","    bad_chains1  float64 89.97
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 3.94e+03\n"," bad_chains0 float64 2.436\n"," bad_chains1 float64 89.97"]},"execution_count":80,"metadata":{},"output_type":"execute_result"}],"source":["az.ess(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"18540CA289C9464BBE4FCE5E6CF091AC","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,\n","- good_chains 的有效样本数量接近4000,说明样本的自相关问题少。\n","- bad_chains0 的有效样本量为2.4,该结果说明该 MCMC 样本的结果非常差。\n","- bad_chains1 的结果虽然比 bad_chains0 多,但是仍然很少。\n","\n","一般建议 ESS 需要大于400,现在更多的要求是大于1000。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"34FDD10820C342C4B708A65203BFADB2","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["除了 ESS 外,另一个反应自相关的指标是 MCMC error,也叫 MCSE。\n","\n","反映了 MCMC链的变异(方差)。其中,自相关越大,变异也越大。\n","\n","我们可以通过 `az.mcse(.)` 函数计算 MCSE。"]},{"cell_type":"code","execution_count":81,"metadata":{"collapsed":false,"id":"1DF032834DB647C890A37415A0D35C9D","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 0.002507\n","    bad_chains0  float64 0.1067\n","    bad_chains1  float64 0.02095
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 0.002507\n"," bad_chains0 float64 0.1067\n"," bad_chains1 float64 0.02095"]},"execution_count":81,"metadata":{},"output_type":"execute_result"}],"source":["az.mcse(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"067901BE60974F4C8880374C4B2F4557","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["结果显示:good_chains 的 MCSE最小,bad_chains0的结果最差。\n","这与 ESS 的解释相符。"]},{"cell_type":"code","execution_count":82,"metadata":{"collapsed":false,"id":"2CD869272DC64DC88A635D52C5814E99","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
mcse_meanmcse_sdess_bulkess_tailr_hat
good_chains0.0030.0023940.03883.01.00
bad_chains00.1070.0872.011.02.41
bad_chains10.0210.01790.0182.01.03
\n","
"],"text/plain":[" mcse_mean mcse_sd ess_bulk ess_tail r_hat\n","good_chains 0.003 0.002 3940.0 3883.0 1.00\n","bad_chains0 0.107 0.087 2.0 11.0 2.41\n","bad_chains1 0.021 0.017 90.0 182.0 1.03"]},"execution_count":82,"metadata":{},"output_type":"execute_result"}],"source":["az.summary(chains, kind=\"diagnostics\")"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3C9D9E22F78B45F0BC396AA2A392F6CC","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 如何改进 MCMC 采样过程"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A001B56C468D43618A4D0F38F0C4CCCB","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过**视觉诊断**和**相应的指标**。我们了解了如何诊断 MCMC 样本可能出现的问题。\n","\n","在诊断出 MCMC 样本可能存在的问题后,下一步就是如何修复这些问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3E1A5A8C21364A42989267DAA4A3EB26","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["常见的解决方法包括:\n","- 增加 MCMC 样本数量。\n"," \n"," 比如将 `trace = pm.sample(draws = 2000)` 中的 draws 扩大为4000及其以上。\n","\n","- 增加 MCMC链的数量。\n"," \n"," 比如将 `trace = pm.sample(draws = 2000, chains=2)` 中的 chains 扩大为4。\n","\n","- 设置burn-in。\n"," \n"," 比如设置 `trace = pm.sample(draws = 2000, tune = 1000)` 这会采样3000个样本,丢掉最开的1000个样本。\n","\t\n","\t这样做的目的在于避免初始值太极端而导致花费太多迭代来达到平稳分布。\n","\t![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/300/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D549DD485B654EDD8B9F15076BB27795","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["这些方法的原理非常简单,目的都是增加有效样本数量。\n","\n","但对于本身就不能收敛的模型,增加样本数量也是没有用的。\n","\n","需要注意的,MCMC诊断只是 Baysian workflow 的其中一环。\n","\n","当通过诊断不能发现问题,我们需要返回 workflow中通过其他方法去发现潜在的问题,\n","\n","常见问题包括:\n","- 由于先验设置错误导致模型无法收敛。\n","- 变量单位不统一导致参数过大或过小。\n","- 数据量太少导致参数估计不准确。\n","- 模型设定错误,导致模型无法收敛。\n","- 编程代码错误,导致模型采样出现问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F77016AFBEEE4165BFAD3D1DB8606F22","jupyter":{},"notebookId":"635b95ced773cf7314ca7a42","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["除了今天介绍的 MCMC 诊断方法 和 修复方法外,还有很多其他的方法,我们会再之后的实战中进行介绍\n","\n","其他的诊断方法:\n","- Rank Plots \n","![Image Name](https://cdn.kesci.com/upload/image/rkina0ih68.png?imageView2/0/w/320/h/320)\n","- Parallel plots\n","- Separation plots\n","- Divergences (HMC特有的问题及诊断方法)\n","\n","source: https://arviz-devs.github.io/arviz/examples/index.html\n","\n","其他修复方法:\n","- 设置 `target_accept`。\n","- 参见 workflow中的设置先验,参数重整化,模型结构调整等。"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"}},"nbformat":4,"nbformat_minor":0} diff --git a/Notebooks/Lecture9_update.ipynb b/Notebooks/Lecture9_update.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..1437f9a2fd583f970a0e188e1f354de27bf5f989 --- /dev/null +++ b/Notebooks/Lecture9_update.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D89B0EAB542642F98900CCC4D4E49360","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Lecture 9: linear model and model diagnostics \n","\n","## Instructor: 胡传鹏(博士)[Dr. Hu Chuan-Peng]\n","\n","### 南京师范大学心理学院[School of Psychology, Nanjing Normal University]\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D6CED1AC4FE848C38E69C72D1B2684FE","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Recap: PyMC3 & A Simple Bayesian Linear Regression Model\n","\n","* Probabilisitic Programming Language & PyMC3\n","* A simple Bayesian linear regression model"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"38B25D21F42C45998BD273A3AD1F6A8E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["假设我们将探究大学生的社交网站使用和创新的自我效能感对个体创新行为的关系。\n","\n","`SNS_t`: 社交网站使用强度;\n","`CSES_t`: 创新自我效能感;\n","`EIB_t`: 创新行为\n","\n","数据来源:郑元瑞, 谢嘉敏, 李鹏. (2022) 社交网站使用强度对大学生创新行为的影响:一个有调节的中介模型. 心理技术与应用, 10(8), 483-491."]},{"cell_type":"code","execution_count":106,"metadata":{"collapsed":false,"id":"AFF5BF9CCB4342369F9B5E6E36A594D0","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["%matplotlib inline\n","import numpy as np \n","from scipy import stats\n","import matplotlib.pyplot as plt\n","import pandas as pd\n","import arviz as az\n","import pymc3 as pm\n","# mpl_toolkits.mplot3d是用于三维画图的,Axes3D用于画三维图\n","from mpl_toolkits.mplot3d import Axes3D"]},{"cell_type":"code","execution_count":107,"metadata":{"collapsed":false,"id":"9605159BD4814F2E8833134CA381C22E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["np.random.seed(123) # 随机数种子,确保随后生成的随机数相同\n","data = pd.read_csv(\"/home/mw/input/data9464/clean.csv\") # 读取数据,该数据需要提前挂载\n","data['SNS_t'] = (data['SNS_t'] - data['SNS_t'].mean()) / data['SNS_t'].std() # 将变量进行标准化\n","data['CSES_t'] = (data['CSES_t'] - data['CSES_t'].mean()) / data['CSES_t'].std() # 将变量进行标准化\n","data['EIB_t'] = (data['EIB_t'] - data['EIB_t'].mean()) / data['EIB_t'].std() # 将变量进行标准化"]},{"cell_type":"code","execution_count":108,"metadata":{"collapsed":false,"id":"B83ADDBCBE4146269D21B23083461386","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:16: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," app.launch_new_instance()\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["#创建一个画布fig1,该画布有两幅图ax1,ax2,画布尺寸为(10,4)\n","fig1, ax1 = plt.subplots(1,2,figsize=(10,4))\n","\n","#在第一张图的第一个panel上画散点图\n","ax1[0].scatter(data['SNS_t'],data['EIB_t'] )\n","ax1[0].set_xlabel('SNS_t')\n","ax1[0].set_ylabel('EIB_t')\n","\n","#在第一张图的第二个panel上画散点图\n","ax1[1].scatter(data['CSES_t'],data['EIB_t'] )\n","ax1[1].set_xlabel('CSES_t')\n","ax1[1].set_ylabel('EIB_t')\n","\n","#在第二张图(三维空间)画散点图\n","fig2 = plt.figure(2)\n","ax2 = Axes3D(fig2)\n","ax2.scatter(data['SNS_t'],data['CSES_t'],data['EIB_t'] )\n","ax2.set_xlabel('SNS_t')\n","ax2.set_ylabel('CSES_t')\n","ax2.set_zlabel('EIB_t')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"27D121C21C28458ABF0C226A81124513","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在PyMC3中,一个简单的线性模型:\n","\n","1. 通过建立线性模型的概率表达形式来构建模型。\n","\n","2. 通过PyMC3对后验进行采样\n","\n","3. 通过Arviz对结果进行展示,辅助统计推断"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3BCD734597754FA295CB25508E37AB08","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["线性模型的一般形式:\n","\n","$y_i = a + b_1 x_1 + b_2 x_2 + \\epsilon$\n","\n","$y_i = \\hat{y} + \\epsilon$\n","\n","线性模型可以用概率的形式进行表达\n","\n","$y \\sim Normal(\\mu_i, sigma)$ -> y $\\sim$ Normal(mu,sigma)\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$ -> mu = alpha + $beta_1$ * $x_1$ + $beta_2$ * $x_2$\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n","\n","$\\beta_{i} \\sim Normal(0,1)$ -> $b_i$ $\\sim$ Normal(mu,sigma)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"594A90953A08482E9017EC6EDAB47BB4","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型构建"]},{"cell_type":"code","execution_count":109,"metadata":{"collapsed":false,"id":"5AA60A18E48C496E86417AE5077DE23B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )\n"]},{"cell_type":"code","execution_count":110,"metadata":{"collapsed":false,"id":"BDBAE07F4A664DFB88CB24EA654DCEB2","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":[""]},"execution_count":110,"metadata":{},"output_type":"execute_result"}],"source":["pm.model_to_graphviz(linear_model)"]},{"cell_type":"code","execution_count":111,"metadata":{"collapsed":false,"id":"8B541AFB45C04C4E8A3DDCB827C91A5B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 10 seconds.\n"]}],"source":["#采样过程仍在该容器中进行\n","with linear_model:\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"086FCE65863642818F65F37CC1C2ED9A","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 参数的后验分布\n","这里的模型分析结果展示了各参数的分布(后验)情况"]},{"cell_type":"code","execution_count":112,"metadata":{"collapsed":false,"id":"574470394C65497C8E033E23521352D4","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["array([[,\n"," ],\n"," [,\n"," ],\n"," [,\n"," ]], dtype=object)"]},"execution_count":112,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制特定参数的采样情况,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_trace(trace,var_names=['alpha','beta','sigma'])"]},{"cell_type":"code","execution_count":113,"metadata":{"collapsed":false,"id":"CE023B55ABD54BF986FE8110902952C6","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["array([,\n"," ], dtype=object)"]},"execution_count":113,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制特定参数的森林图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_forest(trace,var_names=['alpha','beta','sigma'],r_hat=True)"]},{"cell_type":"code","execution_count":114,"metadata":{"collapsed":false,"id":"4048236000BE46C09AE141427572D4F2","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["beta0大于0的概率为0.99625\n","beta1大于0的概率为0.99525\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha', 'beta', 'sigma'三个参数\n","az.plot_posterior(trace,var_names=['alpha','beta','sigma'])\n","# beta1和beta2的后验分布值大于0的比例,若大于95%则可认为对应的自变量对因变量有影响\n","print(f\"beta0大于0的概率为{(trace.posterior.beta[0] > 0).mean().values}\")\n","print(f\"beta1大于0的概率为{(trace.posterior.beta[1] > 0).mean().values}\")"]},{"cell_type":"code","execution_count":115,"metadata":{"collapsed":false,"id":"4D880D7A54474800B6644B5D3BA88FFB","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_bulkess_tailr_hat
alpha-0.0000.019-0.0370.0360.00.03917.02924.01.0
beta[0]0.0490.0200.0110.0860.00.02927.02456.01.0
beta[1]0.7750.0200.7360.8130.00.03210.02733.01.0
sigma0.6090.0140.5830.6340.00.03691.03031.01.0
\n","
"],"text/plain":[" mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \\\n","alpha -0.000 0.019 -0.037 0.036 0.0 0.0 3917.0 \n","beta[0] 0.049 0.020 0.011 0.086 0.0 0.0 2927.0 \n","beta[1] 0.775 0.020 0.736 0.813 0.0 0.0 3210.0 \n","sigma 0.609 0.014 0.583 0.634 0.0 0.0 3691.0 \n","\n"," ess_tail r_hat \n","alpha 2924.0 1.0 \n","beta[0] 2456.0 1.0 \n","beta[1] 2733.0 1.0 \n","sigma 3031.0 1.0 "]},"execution_count":115,"metadata":{},"output_type":"execute_result"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.summary(trace,var_names=['alpha','beta','sigma'])"]},{"cell_type":"code","execution_count":116,"metadata":{"collapsed":false,"id":"01FB7D4CCD1F46A69B2A681269BEE308","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["\n","
\n"," \n"," 100.00% [4000/4000 00:06<00:00]\n","
\n"," "],"text/plain":[""]},"metadata":{},"output_type":"update_display_data"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n"," FutureWarning,\n"]}],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"code","execution_count":117,"metadata":{"collapsed":false,"id":"5D25E7EC4D824B2F97EA73538D1CD1EA","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":117,"metadata":{},"output_type":"execute_result"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n"," func(*args, **kwargs)\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1958F26F212A4B50907AC7C4159FB897","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Part 1: Workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"FA6A7C7B83104BC48F42F8CE727E7454","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## 从理论到问题解决\n","\n","## $p(\\theta|data) = \\frac{p(data|\\theta)p(\\theta)}{{p(data)}}$\n","\n","贝叶斯定理只是告诉我们一个解决问题框架,在实践中,我们需要一系列流程来帮助我们解决问题。\n","\n","这一系列的流程,就是贝叶斯的workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B608751E5EB2494587F7A7AD5B4C6040","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 什么是workflow\n","\n","贝叶斯推理是对条件概率或者概率密度的计算。\n","\n","贝叶斯的workflow除了计算模型的后验分布外还包括研究问题,选择模型,选择先验,模型拟合,采样过程评估,模型诊断,模型比较,统计推断,结果报告。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E67B1D69223C4EC8ACC5F2A4563ADAB2","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 为什么需要workflow?\n","\n","一个完整的workflow是为了更好地解决问题。\n","\n","贝叶斯推断本身就是很复杂的推断过程,这一推断过程涉及到一系列的步骤。\n","\n","如果忽视这个过程中的一些环节,在推断的过程中就会出现遗漏,会对结果产生影响。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"86191C6973D847F18123084960BA8937","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Workflow能带给我们什么\n","\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rklz1ernq7.jpg?imageView2/0/w/960/h/960)\n","\n","\n","\n","* 贝叶斯推断中,我们在探索未知的参数空间,从数据到最终的结论,中间需要经过多个步骤,而每一步骤都有可能会出错;\n","* 过去十年中贝叶斯推断的发展,研究者发现使用标准化的workflow将更好地保证我们进行高质量地探索。\n","\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A8A55F55BB9048D78F598358F70E899F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Workflow包括什么步骤?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A468CCDE3A22462A9C6851929B064811","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["![Image Name](https://cdn.kesci.com/upload/image/rki1v9nkk5.png?imageView2/0/w/960/h/960)\n","\n","Gelman et al. (2020) 提出了一个比较完整的workflow,展示了模型分析过程中可能出现的步骤,出现的问题和相应的调整方法。\n","\n","Gelman等的workflow会涉及到对计算的检查,会通过一系列的方法来验证计算过程是否是有效的,比如模拟假数据。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B6D377C2484649518912D85F50ED6378","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["基于Martin等(2021)年在《Bayesian Modeling and Computation in Python》一书中也提出其workflow。\n","\n","基于以上两个workflow,我们在本课程中采用一个更加简化的workflow,具体包括如下几个步骤:\n","\n","研究问题、选择模型、选择先验、模型拟合、采样过程评估、模型诊断、模型比较、统计推断、结果报告。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5CE0698F4DA0446DABC37AAA63C7325B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["\n","![Image Name](https://cdn.kesci.com/upload/image/rkm3pw954u.png?imageView2/0/w/960/h/960)\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9E0AFEC0AD4A44858B7E5BB9C68D2560","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (1)提出研究问题(driving question)\n","\n","任何一个数据分析都是为了解决一个研究问题,可能是理解的问题,可能是现实的问题,比如:\n","* 做出笑的表情是否会让人更高兴?\n","* 自我相关的信息是否会被优先处理?\n","* 团体辅导是否会是比较有效的干预手段?\n","* 广告的投放是否增加了销量\n","* ...\n","\n","从问题中提炼出统计问题并能够收集能够检验(某种程度上)回答这些问题的数据需要的是在某个领域的专业知识。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"126E4D050D63401AB064A5C7592B4489","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["例子:\n","\n","研究问题:社交网站的使用与是否与个体的创新行为有关系?可能的影响因素是什么?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F9C9E7C91CDB4B809484EB372103386B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (2)数据收集\n","略\n","\n","\n","例子:\n","\n","采用问卷测量大学生三种心理特:社交网站使用、自我交通感、创新行为。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A13FECCB2B3E4865959B968D142672D8","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (3)选择模型\n","\n","根据对研究问题的背景知识和数据的特点,选择合适的模型来对数据进行解释。\n","\n","模型的选择并非一蹴而就,而是需要反复迭代。\n","\n","Workflow的许多步骤即是为了对模型进行诊断和调整。\n","\n","在贝叶斯workflow里,我们需要用概率的形式对模型进行表达。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"51B92657AB92490C801BBD6F83E04231","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["例子:\n","\n","研究问题中涉及到一个因变量和两个自变量,我们可以构建最简单的线性模型。\n","\n","线性模型的一般形式:\n","\n","$\\hat{y} = a + b_1 x_1 + b_2 x_2$\n","\n","$y_i = \\hat{y} + \\epsilon$\n","\n","线性模型可以用概率的形式进行表达:\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$\n","\n","$y \\sim Normal(\\mu_i, \\sigma)$ \n","\n","这就是模型的似然。\n","\n","模型涉及到4个需要估计的参数:$\\alpha$、$\\beta_1$、$\\beta_2$、$\\sigma$。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5E98B54E6D0B42E481E0C244A5CB6CBD","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (4)选择先验 \n","\n","$p(\\theta|data) \\propto p(data|\\theta)p(\\theta)$\n","\n","在贝叶斯推断中,模型的似然函数与参数的先验分布是进行推断的前提。\n","\n","* Lecture #5 中:先验会影响后验分布\n","* 先验分布的选择需要领域的相关知识,\n","* 先验的设定是可以调整的(例如,通过先验预测检查, prior predicitive check)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"83A37C0E974A4F169484E7875019B096","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["例子:\n","\n","社交网站使用、自我效能感与创新能力的简单回归模型涉及到四个参数。\n","\n","由于所有的数据均已经进行标准化($Z$值),所以我们可以合理地认为其回归模型的参数也服从标准正态:\n","\n","$\\alpha \\sim Normal(0,1)$ \n","\n","$\\beta_{i} \\sim Normal(0,1)$ \n","\n","$\\sigma \\sim HalfNormal(1)$ \n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"53A6B7B674E1490B8BC6D1B6702AC418","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["##### (4.1)先验预测检验(Prior Predictive Check)\n","\n","先验的设定是否合理?\n","\n","可以通过先验预测检验( Prior Predictive Distribution )来进行初步的判断。\n","\n","先验预测检验:利用模型和先验生成假数据并利用这些生成的假数据来评估先验是否合理的过程。"]},{"cell_type":"code","execution_count":118,"metadata":{"collapsed":false,"id":"29A45A9370D245DB9C39F20B569C34DC","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":119,"metadata":{"collapsed":false,"id":"30296BFEEE5645A28874073EA0A0C4C1","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," \n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["fig = plt.figure()\n","ax = Axes3D(fig) #讲画布对象改为3d对象\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","x2 = np.linspace(-3, 3, 50)\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b[0] * x1 + b[1]*x2 # 基于假数据生成预测值\n"," ax.plot(x1, x2, y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3EDECF0064FB45EC80391254620FED04","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们可以看到,基于参数的先验生成的因变量的取值范围在-4-6之间,与真实情况相差不大,可以接受。\n","\n","若我们改变一下先验,则会看到因变量取值范围在-100-100之间,这与真实情况相差较大,故需要调整"]},{"cell_type":"code","execution_count":120,"metadata":{"collapsed":false,"id":"B525EE2480444110805F0EE804148DDD","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha', mu=0, sd=10)\n"," beta = pm.Normal('beta', mu=0, sd=10, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma', sd=1)\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":121,"metadata":{"collapsed":false,"id":"D351E620DF2C472D8F4A1C77757BDEA3","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.\n"," \n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["fig = plt.figure()\n","ax = Axes3D(fig) #讲画布对象改为3d对象\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","x2 = np.linspace(-3, 3, 50)\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b[0] * x1 + b[1]*x2 #基于假数据生成预测值\n"," ax.plot(x1,x2,y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F1D9765D062A49C991A5161E94AF7852","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (5)拟合数据\n","\n","设置好模型及其先验后,即可对数据进行拟全,从而获得关于参数的后验。\n","\n","$y \\sim Normal(\\mu_i, sigma)$ -> y $\\sim$ Normal(mu, sigma)\n","\n","$\\mu_i = \\alpha+\\beta_1 * x1 +\\beta_2 * x_2$ -> mu = alpha + $beta_1$ * $x_1$ + $beta_2$ * $x_2$\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu=0, sigma=1)\n","\n","$\\beta_{i} \\sim Normal(0,1)$ -> $b_i$ $\\sim$ Normal(mu=0, sigma=1)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n"]},{"cell_type":"code","execution_count":122,"metadata":{"collapsed":false,"id":"5044BDED17C449FD86EF96403AE20133","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["#首先,载入pymc3这个包,将其命名为pm\n","import pymc3 as pm\n","# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha', mu=0, sd=1)\n"," beta = pm.Normal('beta', mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma', sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"6E47BD0579234CCC934B06DFAD1A0E06","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (6)评估/诊断MCMC\n","\n","如果使用MCMC对后验进行近似,则需要首先对MCMC过程进行评估。\n","\n","* 是否收敛;\n","* 是否接近真实的后验。\n","\n","对采样过程的评估我们会采用目视检查或rhat这个指标。"]},{"cell_type":"code","execution_count":123,"metadata":{"collapsed":false,"id":"C76B6F098EAC4D62BF5D505636AA6F1B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n"]}],"source":["#采样过程仍在该容器中进行\n","with linear_model :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C1FD8EC84AB54403818131D461CFA822","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (7)模型诊断\n","\n","在MCMC有效的前提下,需要继续检验模型是否能够较好地拟合数据。\n","\n","我们会使用后验预测分布通过我们得到的参数生成一批模拟数据,并将其与真实数据进行对比。\n"]},{"cell_type":"code","execution_count":124,"metadata":{"collapsed":false,"id":"804F825F72A247E08A4BCBF28EB3F486","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["\n","
\n"," \n"," 100.00% [4000/4000 00:06<00:00]\n","
\n"," "],"text/plain":[""]},"metadata":{},"output_type":"update_display_data"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n"," FutureWarning,\n"]}],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"code","execution_count":125,"metadata":{"collapsed":false,"id":"D64E39385D5F4FA287F0F18BEBDCFF51","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":125,"metadata":{},"output_type":"execute_result"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n"," func(*args, **kwargs)\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"DF6ABD4C903843E58BCAA47C9F462669","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (8)模型比较\n","\n","到目前为止,我们已经使用后验预测检查来独立评估已有的模型。\n","\n","这种类型的评估对于单独理解每个模型很有用。然而,当我们有多个模型时,这就引出了模型相对于彼此的性能如何的问题。\n","\n","模型比较可以进一步帮助我们了解一个模型在哪些区域可能表现良好。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D6227F196049426D8D1CD781A1E5F577","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["首先,我们构建一个新的模型,假定该模型中其他条件均不变,观测值服从gumbel分布"]},{"cell_type":"code","execution_count":126,"metadata":{"collapsed":false,"id":"59E74B06B3FA45B6ADE05C127090DF41","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["with pm.Model() as linear_model_2:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," #先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":127,"metadata":{"collapsed":false,"id":"92201AFA5EC745EF85B6C9586430E4EF","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n","# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n","# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n","with pm.Model() as linear_model_2:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=2) # shape 用来表示同样的分布的参数的数量\n"," sigma = pm.HalfNormal('sigma',sd=1)\n","\n"," # x1,x2为自变量,是之前已经载入的数据\n"," x1 = pm.Data(\"x1\", data['SNS_t'])\n"," x2 = pm.Data(\"x2\", data['CSES_t'])\n","\n"," # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n"," mu = pm.Deterministic(\"mu\", alpha + beta[0]*x1 + beta[1]*x2) \n","\n"," # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'] )"]},{"cell_type":"code","execution_count":128,"metadata":{"collapsed":false,"id":"A589CB367D3B4836A3241805C1146F01","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":true,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 6 seconds.\n"]}],"source":["#采样过程仍在该容器中进行\n","with linear_model_2 :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace2 = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"code","execution_count":129,"metadata":{"collapsed":false,"id":"20865B22396941C9BE7D8F34CF5FED52","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["\n","
\n"," \n"," 100.00% [4000/4000 00:06<00:00]\n","
\n"," "],"text/plain":[""]},"metadata":{},"output_type":"update_display_data"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n"," FutureWarning,\n"]}],"source":["# 后验预测分布的计算仍在容器中进行\n","with linear_model_2:\n"," # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n"," ppc_y = pm.sample_posterior_predictive(trace2.posterior) \n","#将ppc_y转化为InferenceData对象合并到trace中\n","az.concat(trace2, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)"]},{"cell_type":"code","execution_count":130,"metadata":{"collapsed":false,"id":"587EA94AF48A4E0A9B40105776BF61BD","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":130,"metadata":{},"output_type":"execute_result"},{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n"," func(*args, **kwargs)\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 绘制后验预测分布\n","az.plot_ppc(trace2)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A4C90CCF7AC444F68B77371040279A81","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["对于两个模型哪个更好。从我们的视觉检查来看,似乎不能直接看出区别。我们可以使用 ArviZ 中的比较方法来比较这一观察结果。"]},{"cell_type":"code","execution_count":131,"metadata":{"collapsed":false,"id":"ED3A3BAB4A4946B68344345612128C90","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/opt/conda/lib/python3.7/site-packages/arviz/stats/stats.py:146: UserWarning: The default method used to estimate the weights for each model,has changed from BB-pseudo-BMA to stacking\n"," \"The default method used to estimate the weights for each model,\"\n"]},{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
rankloop_lood_looweightsedsewarningloo_scale
normal0-938.7138875.3035370.0000001.000000e+0030.3788140.000000Falselog
gumbel1-938.7357935.3324690.0219061.110223e-1630.3853010.027451Falselog
\n","
"],"text/plain":[" rank loo p_loo d_loo weight se \\\n","normal 0 -938.713887 5.303537 0.000000 1.000000e+00 30.378814 \n","gumbel 1 -938.735793 5.332469 0.021906 1.110223e-16 30.385301 \n","\n"," dse warning loo_scale \n","normal 0.000000 False log \n","gumbel 0.027451 False log "]},"execution_count":131,"metadata":{},"output_type":"execute_result"}],"source":["compare_dict = {\"normal\": trace,\"gumbel\": trace2}\n","comp = az.compare(compare_dict, ic=\"loo\")\n","comp"]},{"cell_type":"code","execution_count":132,"metadata":{"collapsed":false,"id":"A73CC6CFA6AE45D19D2C5C6CBE6E5B1E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":[""]},"execution_count":132,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_compare(comp)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"94C1B94B3EDB4FD49520D031B24FA8D7","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们一般会选择loo值低的模型作为最佳模型,除了loo外还有AIC。BIC等指标,我们会在以后的课程中给大家讲解。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"16845DCCB046485C894289D6EEB1D028","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (9)统计推断"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"7FDDA4F5AF2F4DB88DE828B84558DC09","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在选择最佳模型之后,我们就会根据最佳模型进行统计推断。\n","\n"]},{"cell_type":"code","execution_count":133,"metadata":{"collapsed":false,"id":"D81CEF1C4A87484A870F070B02775D05","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["beta0大于0的概率为0.9975\n","beta1大于0的概率为0.99525\n"]},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# 参数的后验分布图,选取对象为trace,选取其中'alpha','beta','sigma'三个参数\n","az.plot_posterior(trace,var_names=['alpha','beta','sigma'])\n","# beta1和beta2的后验分布值大于0的比例,若大于95%则可认为对应的自变量对因变量有影响\n","print(f\"beta0大于0的概率为{(trace.posterior.beta[0] > 0).mean().values}\")\n","print(f\"beta1大于0的概率为{(trace.posterior.beta[1] > 0).mean().values}\")"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C854CF377ADF43DA8C79BCF8030A6FAF","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["由于beta1和beta2的采样值有超过99%都大于0,于是我们可以认为这两个变量对因变量产生了影响。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"18EE46DC040E4DB68495C62825F21E8B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### (10)报告结果"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F887E783CEC9450B8A2C327D9526271D","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["结果的报告需要看视对象而不同:\n","* 专业的读者(向学术期刊投稿)\n","* 大领域同行(学术会议上的报告)\n","* 非同行的其他研究者\n","* 非研究者(科普)\n","* 决策者"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"96A6C7070FAE422AB5F82CCDD17F75CC","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["# Practice\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"7C05B198BBC449B888819BE0C3BFC025","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Workflow"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"CA4D8ACF66FA4B41B8670988B0E2222B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 研究问题"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"41F7BA7522764747AC1D5483CF3D3577","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["还有研究发现个体创新行为可能与自尊水平有关。\n","SES_t:自尊水平;\n","EIB_t创新行为\n","\n","试探究两者之间的关系。"]},{"cell_type":"code","execution_count":134,"metadata":{"collapsed":false,"id":"3E1441473A0044AB82E2AD07EECF6CD6","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["Text(0, 0.5, 'EIB_t')"]},"execution_count":134,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["data['SES_t'] = (data['SES_t'] - data['SES_t'].mean()) / data['SES_t'].std()#将变量进行标准化\n","\n","data['EIB_t'] = (data['EIB_t'] - data['EIB_t'].mean()) / data['EIB_t'].std()#将变量进行标准化\n","\n","plt.scatter(data['SES_t'],data['EIB_t'])\n","plt.xlabel('SES_t')\n","plt.ylabel('EIB_t')"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"DCC7B6BBFB09450F80088400F0F96F2C","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择模型"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"BB2E6CBC0F7A48DEBD859F368A466CA7","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["尝试构建一个简单的线性模型\n","\n","线性模型可以用概率的形式进行表达\n","\n","\n","$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n","\n","$\\beta \\sim Normal(0,1)$ -> b $\\sim$ Normal(mu,sigma)\n","\n","$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n","\n","$\\mu_i = \\alpha + \\beta *x$ -> mu = alpha + beta*x \n","\n","$y \\sim Normal(\\mu_i,sigma)$ -> y $\\sim$ Normal(mu,sigma)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"8DB0F9B932D940B786C3617DD25C2EDF","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 选择先验"]},{"cell_type":"code","execution_count":135,"metadata":{"collapsed":false,"id":"BD6BE331DE634FE785A6060175F429B3","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","# 请尝试用代码表达模型的先验\n","with pm.Model() as linear_model:\n"," # 先验分布: alpha, beta, sigma这三个参数是随机变量\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1, shape=1) \n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," \n"," # 先验预测检查\n"," prior_checks = pm.sample_prior_predictive(samples=50)"]},{"cell_type":"code","execution_count":136,"metadata":{"collapsed":false,"id":"1ADDD4F654D048728CD0B7DA51418D33","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["fig, ax = plt.subplots()\n","x1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n","\n","for a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n"," y = a + b * x1 \n"," ax.plot(x1,y)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9F6272FA34C240859DDA51BC24A19A4C","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 模型拟合"]},{"cell_type":"code","execution_count":137,"metadata":{"collapsed":false,"id":"6C343DCE8B7040B784EA97742217B91F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["########################################################\n","# 练习\n","########################################################\n","linear_model = pm.Model()\n","with linear_model :\n"," #\n"," alpha = pm.Normal('alpha',mu=0,sd=1)\n"," beta = pm.Normal('beta',mu=0,sd=1,shape=1)\n"," sigma = pm.HalfNormal('sigma',sd=1)\n"," #\n"," x = pm.Data(\"x\", data['SES_t'])\n"," #\n"," mu = pm.Deterministic(\"mu\", alpha + beta*x) \n"," #\n"," y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"79E27FECC9704144BB0266A0494AF5C5","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 采样过程评估"]},{"cell_type":"code","execution_count":138,"metadata":{"collapsed":false,"id":"E2C2D915ED1044C095AAA524FBF9D793","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Auto-assigning NUTS sampler...\n","Initializing NUTS using jitter+adapt_diag...\n","Multiprocess sampling (2 chains in 2 jobs)\n","NUTS: [sigma, beta, alpha]\n"]},{"data":{"text/html":["\n","\n"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stderr","output_type":"stream","text":["Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n"]}],"source":["########################################################\n","# 练习\n","# 请尝试调用pm.sample()函数,用mcmc进行采样\n","########################################################\n","with linear_model :\n"," # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n"," # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n"," # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n"," trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)"]},{"cell_type":"code","execution_count":139,"metadata":{"collapsed":false,"id":"F7D6B1475987489F9F7C1EC643A01B42","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/plain":["array([[,\n"," ],\n"," [,\n"," ],\n"," [,\n"," ]], dtype=object)"]},"execution_count":139,"metadata":{},"output_type":"execute_result"},{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["########################################################\n","# 练习\n","# 请尝试调用az.plot_trace()函数,检查模型收敛情况\n","########################################################\n","az.plot_trace(trace,var_names=['alpha','beta','sigma'])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"05079881F3E940158D5E45BC4E71F278","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["## Part 2: MCMC Diagnostic"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"6628683531E94FAB84E167AFDD6E2B76","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["在贝叶斯推断中,通过概率统计语言 PyMC3 实现 MCMC 是关键的步骤。\n","```python\n","with model:\n"," trace = pm.sample(draws = 2000, chains=2)\n","```\n","比如在上述代码示例中,我们对贝叶斯模型采样2000个样本,并同时运行2条马尔科夫链。\n","\n","在贝叶斯的workflow中,也强调需要对MCMC进行检验,检验的核心在于: \n","\n","- 这2000个样本真的能代表目标分布吗?\n","- 我们如何检测MCMC的样本是否存在问题?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2D5ACE0A00FF40FF89211B7271AD0B97","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["注意:\n","- PyMC 的结果 traces 中包含多个 MCMC chains\n","- 每个 chain 中可能包括多个参数\n","- 每个参数的采样都可以代表为一个参数后验分布(目标分布)\n","\n","为了方便,接来的讨论的**单个chain**都只包含**单个参数**。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"346F2A34864F4437BCEBE4D049187B6F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 什么是 MCMC diagnostic?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1FC4E15F61524D8FABB2F7D2DB347454","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了检验MCMC 样本的质量与问题,我们需要一些诊断工具,这些诊断工具统称为:马尔可夫链蒙特卡罗诊断(Markov Chain Monte Carlo diagnostics),简称 MCMC diagnostics。\n","\n","MCMC diagnostic 可以用来检查用MCMC算法采样得到的样本是否足以提供目标分布的精确近似。\n","\n","MCMC diagnostic 的目的:\n","1. 发现 MCMC 样本存在的问题。\n","2. 根据诊断对 MCMC 算法进行修正(fix)。\n","\n","MCMC diagnostic 检查的具体内容:\n","- MCMC样本的大部分采样是否来自目标分布(target distribution)。\n","- MCMC样本量是否足够近似目标分布。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5DA3DAC5DA184FEDB5A80B505FB14CD4","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 样本的两大问题"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"9F02AC9DA9B547898DF15081B3A11A8F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["要了解 MCMC diagnostic 检测到底是什么,首先我们需要知道 MCMC 样本最常见的问题。\n","\n","MCMC 样本最常见的两大问题:\n","1. bias: 样本是否代表整个后验分布?\n","2. precision: 是否有足够的样本来获得足够精确的统计推断?\n","\n","这两个问题涉及到 MCMC 诊断的两个专有概念:收敛性与有效样本量。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"1302A2D8C6874E7F8032E14C26960DDD","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["1. 收敛(convergence)问题。\n"," \n"," 由于马尔科夫链的性质,无论 MCMC 算法的起始值(initial or starting value)和起始分布为何,随着采样的进行,样本最终都应该接近目标分布。\n","\n"," 而由于各种问题(比如,代码错误,或者先验定义错误),样本无法接近目标分布,我们称为 MCMC chain **无法收敛**。见下图2。\n"," \n"," 即使 MCMC chain 可以收敛,如果起始值离目标分布太远,那么它收敛所需的时间也更长,见下图3。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F8A772D4D6AC40EDACCD77D480D2814E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["图1:参数收敛时MCMC采样的样本\n","\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijyv4px4.png?imageView2/0/w/300/h/640)\n","\n","\n","图2:参数不收敛时MCMC采样的样本\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijynzzao.png?imageView2/0/w/300/h/640)\n","\n","图3:起始值过于远离目标分布时的MCMC采样\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/300/h/640)\n","\n","source: https://statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo-diagnostics"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"149D8AF248DC45A18BAD6C29E5029F3A","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["2. 有效样本量(effective sample size)过于少的问题。\n","\n"," 在MCMC过程中,样本量越多,样本分布越接近目标分布;但是当样本数量过少时,可能对目标分布产生错误的推断。那什么因素会导致采样样本的数量变少?\n"," \n"," 由于马尔科夫链的性质,当前采样 $\\theta_{t}$ 依赖于 上一次的采样 $\\theta_{t-1}$。因此,样本间的关系不是独立的,这违反了独立采样的假设。违反独立采样的后果可能就是 MCMC chain 无法收敛,如下图2。\n","\n"," 如果 $\\theta_{t}$ 与 $\\theta_{t-n}$ 的间隔 n 越大,两个样本的相关性越低。\n","\n"," **通过每隔 n 保留一个采样可以减少样本相关性的影响,但代价就是有效样本量变少**。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"273493C6105540A68FEFE0D46A6919FB","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 视觉检查 (Visual Diagnostics)\n","\n","检查 MCMC 采样质量最直观的方式就是视觉检查 (Visual Diagnostics)。\n","\n","上面的采样轨迹图就是视觉检查的一种形式。所以我们对此并不陌生。\n","\n","常见的视觉检查有两类:\n","- 轨迹图 Trace plots。即上面的图。\n","- 自相关函数图 Autocorrelation function (ACF) plots。\n"," \n"," 前面提到样本间存在相关性,而这种自相关性会导致轨迹图“不收敛”。而不收敛的情况有很多,并不能从轨迹图中得知是因为自相关导致了不收敛,因此我们需要自相关函数图来诊断是否存在自相关的问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"EC5671E8ED8F40AEA5E68A4218E09813","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["接下来我们结合代码示例进行演示。\n","\n","以前使用过的 `ArviZ` 工具包就可以完成大部分的诊断工作。所以接下来的分析会依赖于 `ArviZ` 。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"ACCCCAC29CB645618E1ED48C5A4BFCF8","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了演示上述不同问题对于 MCMC 采样的影响,我们生成了三种 MCMC chain:\n","\n","- good_chains 包含两条 MCMC chain,生成自一个 $\\beta$ 分布,并且两次采样间是完全独立的。\n","- bad_chains0 在 good_chains 的基础上,加入了采样间的自相关。\n","- bad_chains1 在 good_chains 的基础上,在采样的局部加入高度自相关。bad_chains1 代表了一个非常常见的场景,采样器可以很好地解析参数空间的一个区域,但是很难对一个或多个区域进行采样。\n","\n","\n","> 代码源自:2.4 Diagnosing Numerical Inference in Martin, from O. A., Kumar, R., & Lao, J. (2021). Bayesian Modeling and Computation in Python (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003019169\n"]},{"cell_type":"code","execution_count":140,"metadata":{"collapsed":false,"id":"1EE9394D835A4192A7CBF3183F831FCB","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["good_chains = stats.beta.rvs(2, 5,size=(2, 2000))\n","bad_chains0 = np.random.normal(np.sort(good_chains, axis=None), 0.05, size=4000).reshape(2, -1)\n","bad_chains1 = good_chains.copy()\n","\n","for i in np.random.randint(1900, size=4):\n"," bad_chains1[i%2:, i:i+100] = np.random.beta(i, 950, size=100)\n","\n","chains = {\"good_chains\":good_chains,\n"," \"bad_chains0\":bad_chains0,\n"," \"bad_chains1\":bad_chains1}"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"C86B1786279149F380E500F097EFEA4F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Trace Plots"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"EF0ADBBEDF0641528AA2AFBD1CCBC6AE","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["轨迹图(trace plot)是贝叶斯中最常见的诊断图,横坐标为每个迭代步骤,纵坐标为采样值。\n","\n","从这些图中,我们能够看到 \n","- MCMC chain 是否收敛。\n","- 同一个参数每条链是否收敛为相同的分布。\n","- 初步观察自相关程度。\n"," \n","在ArviZ中,通过调用函数az.plot_trace(.),可以在右侧得到轨迹图,在左侧得到参数分布图。"]},{"cell_type":"code","execution_count":141,"metadata":{"collapsed":false,"id":"96D4C029B1C44AB18FBF4D038AEF4EB5","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_trace(chains)\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E009BE84FE954264B8F656D1B16AA851","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,\n","- good_chains 的表现最好,并且其中的两条 chain 几乎重合。\n","- bad_chains0 并不收敛,其中的两条 chain 并不趋向于同一个目标分布,并且两者有持续扩大的趋势。\n","- bad_chains1 虽然收敛,但是采样不均匀,即多个波峰(左图),表现为在局部过多采样(右图)。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"FD79A5834996434BA95F0722BF19486B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["此外,如前面提到的,对于多条 MCMC 链,trace plot 还可以观测到不同初始值对采样的影响。\n","\n","![](https://bcss.org.my/tut/wp-content/uploads/2020/04/rw3.gif)\n","\n","source: https://bcss.org.my/tut/bayes-with-jags-a-tutorial-for-wildlife-researchers/mcmc-and-the-guts-of-jags/mcmc-diagnostics/\n","\n","比如,由于初始值过于极端,导致需要花费多次迭代才能达到平稳分布。\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/400/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"85036891737B422A80E275DB2FD0BB93","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Autocorrelation Plots\n","\n","正如我们在讨论有效样本容量时所看到的,自相关减少了样本中包含的样本量。\n","\n","因此,我们希望自相关尽量小的同时样本量足够多。\n","\n","可以用`az.plot_autocorr`绘制自相关图(autocorrelation plot)检查自相关性。"]},{"cell_type":"code","execution_count":142,"metadata":{"collapsed":false,"id":"030E1856D1A341C980ED89F2A7D7C50B","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":[""],"text/plain":["
"]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["az.plot_autocorr(chains, combined=True) # combined=True表明将两条chains合并为1条\n","plt.show()"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"94D1AEE551D34A039D67CCCA266064F9","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["上图为自相关函数在100步间隔窗口上的条形图。\n","- 横坐标为 $\\theta_{t}$ 与 $\\theta_{t+n}$ 的间隔 n,该间隔越大,两个样本的相关性越低\n","- 纵轴为 $\\theta_{t}$ 与 $\\theta_{t+n}$ 的相关性。\n","- 灰色带表示所有相关系数形成的95%置信区间。\n","\n","该图结果显示:\n","- good_chains 的纵向高度接近于零(并且大部分在灰色带内),这表明自相关性非常低。\n","- bad_chains0 和 bad_chains1 中的高条表示自相关性值较大。\n","- bad_chains1 随着 n 的间隔增大,自相关减少。而 bad_chains0 不会随着 n 增大减少,说明该 MCMC chain的自相关问题非常严重。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"55C6613FD27B4C099F25A08F453C8A75","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["为了更好的理解自相关系数(y轴)如何计算,以及 间隔n对自相关的影响。\n","\n","我们可以自定义一个函数:\n","- 首先,取出 $\\theta_{t}$\n","- 然后,取出 $\\theta_{t+n}$\n","- 最后计算两者的相关系数\n","\n","需要注意,在当前的例子中,t的总长度为2000,每当 n 增加1,$\\theta_{t}$ 和 $\\theta_{t+n}$ 的长度就会减少1。因为间隔不可能超过2000。"]},{"cell_type":"code","execution_count":143,"metadata":{"collapsed":false,"id":"5D48AE1C7B2149C386030332AB70CAC0","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[],"source":["def autocorr(lag, chain):\n"," n = lag # 定义间隔n的大小\n"," t = 2000 # 单条 chain的总长度\n"," theta_t = chain[0:(t-1-n)] # 提取 theta_t\n"," theta_tn = chain[n:(t-1)] # 提取 theta_t+n\n"," rho = np.corrcoef(theta_t,theta_tn) # 计算相关系数\n"," print(\"间隔为\", n, \"时,的相关系数\", rho.round(2)[0][1]) # 输出结果"]},{"cell_type":"code","execution_count":144,"metadata":{"collapsed":false,"id":"14515840622A4EC08740F93190D2AB7E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 1 时,的相关系数 -0.02\n","间隔为 1 时,的相关系数 0.63\n","间隔为 1 时,的相关系数 -0.02\n"]}],"source":["# 间隔 n = 1时,三条链的自相关性\n","n = 1\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":145,"metadata":{"collapsed":false,"id":"99760EF9D2EF473E8DFEDCE018F9F998","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 2 时,的相关系数 0.02\n","间隔为 2 时,的相关系数 0.63\n","间隔为 2 时,的相关系数 0.02\n"]}],"source":["# 间隔 n = 2时,三条链的自相关性\n","n = 2\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":146,"metadata":{"collapsed":false,"id":"0A3B2F6F13EB4B9992DC9CCD3F4E432D","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 3 时,的相关系数 0.03\n","间隔为 3 时,的相关系数 0.63\n","间隔为 3 时,的相关系数 0.03\n"]}],"source":["# 间隔 n = 3时,三条链的自相关性\n","n = 3\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"code","execution_count":147,"metadata":{"collapsed":false,"id":"CB96E3612D684E9A87B90714369E5FA8","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["间隔为 50 时,的相关系数 -0.01\n","间隔为 50 时,的相关系数 0.62\n","间隔为 50 时,的相关系数 -0.01\n"]}],"source":["# 间隔 n = 50时,三条链的自相关性\n","n = 50\n","autocorr(n, chains[\"good_chains\"][0])\n","autocorr(n, chains[\"bad_chains0\"][0])\n","autocorr(n, chains[\"bad_chains1\"][0])"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"2679BE052C3A4934B573E2D6E2B15305","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,结果与 autocorrelation的图结果一致:\n","- good_chains 的结果始终在0附近,表明自相关低。\n","- bad_chains0 的结果显示,当间隔n扩大到50时,自相关系数从0.64下降到0.63。\n","- bad_chains1 的结果显示,当间隔n扩大到50时,自相关系数从0.15下降到0.08附近。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A44FD4DE93764E119701A77FDAB54B82","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["上述计算过程可以通过 Arviz自带的 `autocorr`函数进行计算。"]},{"cell_type":"code","execution_count":148,"metadata":{"collapsed":false,"id":"DC9C496993384C6588F8A3B59C12A32E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"name":"stdout","output_type":"stream","text":["good_chains的相关系数: [-0.02 0.02 0.02 -0.01]\n","good_chains的相关系数: [0.63 0.63 0.63 0.59]\n","good_chains的相关系数: [-0.02 0.02 0.02 -0.01]\n"]}],"source":["n = [1,2,3,50] # 定义间隔为 1,2,3 和 50\n","rhos = az.autocorr(chains[\"good_chains\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)\n","rhos = az.autocorr(chains[\"bad_chains0\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)\n","rhos = az.autocorr(chains[\"bad_chains1\"][0]).round(2)[n]\n","print(\"good_chains的相关系数:\",rhos)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"99C77C10115F478E8FC1881A7BE3D8F2","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过轨迹图和自相关图,我们可以诊断出 MCMC chain 中存在的不同问题。\n","\n","但自然而然能想到的两个问题是:\n","- 通过视觉方法判断 MCMC 的问题是否主观?能否通过客观的指标进行诊断?\n","- 发现了 MCMC 的问题,我们如何进行修正?"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"AD3956B1142043B48D16C9975E3DFD83","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### MCMC 诊断指标"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"CD674A92366B400F8F48277FF7C5CBE1","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过绘图对MCMC进行检查的优势在于直观,并且能通过图推测问题在什么地方。\n","\n","但对于许多参数的模型来说,挨个检查图形是非常困难的。\n","\n","因此,通过数值进行判断一方面可以提高诊断的客观性,另一方面提高了诊断的效率。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"BCB9A10C863947548A3031FC4DAA4BAA","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["针对 MCMC 的两大问题,这里存在两个主要的指标:\n","1. 针对收敛问题的:Rhat,$\\hat{R}$,也称为“潜在规模缩减因子” (Potential Scale Reduction Factor, PSRF)。它将单个链样本的变异与混合了所有链样本的变异进行比较。\n","2. 针对有效样本量问题的:effective sample size (ESS) or effective number of draws (n.eff)。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"41F018CE8BA840C69756E8789F3C638D","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### Convergence and Rhat"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"114B51D6E0654D329CB4530B46B5C56F","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["前面提到,收敛(convergence)描述了 MCMC 样本两个层面的问题:\n","1. MCMC 采样是否从初始值“收敛”到一个稳定的分布。\n","2. 即使链收敛到了一个稳定的分布,多条链形成的稳定分布是否相似。\n","\n","反面例子就是 bad_chains0: \n","\n","该采样既没有收敛到一个稳定的分布(右图中采样整体向上偏移),并且两个链形成的分布也不相同(左图)。\n","![Image Name](https://cdn.kesci.com/upload/image/rkimd3udb3.png?imageView2/0/w/640/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"331F0B70D62146A186AF06D0787BE05C","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat的作用就是通过一个指标反应上述两个层面的问题。\n","\n","具体逻辑为:\n","- 运行多条 MCMC 链,并且以不同的初始值开始采样。\n","- 比较每条链的相似性。\n","\n","数学逻辑为:\n","- 计算θ的所有样本的标准差,包括所有链的样本\n","- 计算每条链标准偏差,并通过平方平均数整合到一起\n","- 比较两个标准差的大小。如果两个的除数为1,说明两个变异相同,表明他们存在相似性。\n","- 如果,该值大于1.01,说明两个变异可能存在差异,即参数可能不收敛。\n","\n","source: \n","- https://www.r-bloggers.com/2020/01/applied-bayesian-statistics-using-stan-and-r/"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F82C485B084E42E5A4F9E3047359842D","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["假设,参数 $\\theta$ 有m个 chain 和 n个 样本。\n","\n","1. 用 $s_m^2$ 表示每条链内部的方差。那么链内部的变异 within-chain variance $W$ 为所有链方差的均值 $W = \\frac{1}{M} \\, \\sum_{m=1}^M s_m^2$\n","2. 用 $\\bar{s}_m^2$ 表示所有链均值计算的方差。链间的变异 between-chain variance $B$ 为 $B = \\bar{s}_m^2$。\n","3. 结合在一起得到 Rhat $\\hat{R} = \\sqrt{\\frac{W+B}{W}}$\n","\n","可见,原理和方差分析类似,如果链间变异B趋近于0,那么Rhat趋近于1,说明各链的均值几乎相同,以此反应他们的相似性。\n","\n","这里的数学公式有所简化,详细请看 https://mc-stan.org/docs/reference-manual/notation-for-samples-chains-and-draws.html。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B39B75E35ECB42EF8C3701EBC9615758","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat计算的条件是有多条 MCMC 链并且尽可能的每条链使用不同的起始值。\n","\n","我们可以通过 `az.rhat(.)` 简化计算过程。"]},{"cell_type":"code","execution_count":149,"metadata":{"collapsed":false,"id":"803C707461014C6A8F6A096D4ABBFB15","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 1.0\n","    bad_chains0  float64 2.403\n","    bad_chains1  float64 1.02
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 1.0\n"," bad_chains0 float64 2.403\n"," bad_chains1 float64 1.02"]},"execution_count":149,"metadata":{},"output_type":"execute_result"}],"source":["az.rhat(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5F7D0041F530481882C769AC4BFA798A","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["从这个结果中我们可以看出,\n","- good_chains 的 Rhat 接近1,表示该参数的多条链都是收敛并且相似的。\n","- bad_chains0 的结果非常糟糕,这可以结合轨迹图得到佐证。\n","- bad_chains1 的结果相对较好,但其Rhat仍然大于1.01,说明该参数的收敛程度差,这点通过轨迹图有时难以发现其存在的问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"60AAA67CF55A4753A4E96F071BC54C32","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["Rhat的最低要求是低于1.1,但现在大多数发表的文章要求小于1.01。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"B8B5604F23884ABDA58C1DFC82936340","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["#### 有效样本量 effective sample size (ESS) 与自相关 autocorrelation"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"E266E5EF75AA45E3BB92525997BCF16E","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["前面通过自相关函数图发现,bad_chains1 会随着 n 间隔增大而自相关减少。\n","\n","但这样做的代价是可用的样本数量变少。\n","\n","\n","![Image Name](https://cdn.kesci.com/upload/image/rkimhrvf7h.png?imageView2/0/w/640/h/640)\n"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"06449FBAF2084318A211A85E700D98B7","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["因此,我们是否能通过一个指标反应,在满足最小自相关的条件下,即只保留隔n个采样的样本时,可用的样本还剩下多少?\n","\n","这就是 有效样本量(effective sample size,ESS)指标的意义。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"09519EA6829E40FB863561B1A9C6D226","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["ESS计算的逻辑为:\n","- 在样本量为n的 MCMC链中,分别计算 1到n 采样间隔下的样本相关性 $\\rho$,这样可以计算得到,n个 $\\rho_{n}$。\n","- 如果样本间自相关越大,$\\rho$ 越接近1,相反则接近0。因此,n个 $\\rho_{n}$ 相加,其值越接近n,说明自相关越大。\n","- 结合这个想法,用 n 比上 $\\rho_{n}$ 的求和即可得到有效样本量。\n"," \n"," $ESS = \\frac{N}{\\sum_{t = -\\infty}^{\\infty} \\rho_t}$\n","\n"," 如果自相关特别大,那么 ESS 接近1,如果自相关特别小,那么ESS接近n。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"5F7892F715384BA1BBDC6218A481F9FA","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["我们可以通过 `az.ess(.)` 函数计算 ESS。"]},{"cell_type":"code","execution_count":150,"metadata":{"collapsed":false,"id":"24E130C91D1F4DE4B518EBE072306954","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 3.965e+03\n","    bad_chains0  float64 2.438\n","    bad_chains1  float64 148.1
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 3.965e+03\n"," bad_chains0 float64 2.438\n"," bad_chains1 float64 148.1"]},"execution_count":150,"metadata":{},"output_type":"execute_result"}],"source":["az.ess(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"18540CA289C9464BBE4FCE5E6CF091AC","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["可以看到,\n","- good_chains 的有效样本数量接近4000,说明样本的自相关问题少。\n","- bad_chains0 的有效样本量为2.4,该结果说明该 MCMC 样本的结果非常差。\n","- bad_chains1 的结果虽然比 bad_chains0 多,但是仍然很少。\n","\n","一般建议 ESS 需要大于400,现在更多的要求是大于1000。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"34FDD10820C342C4B708A65203BFADB2","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["除了 ESS 外,另一个反应自相关的指标是 MCMC error,也叫 MCSE。\n","\n","反映了 MCMC链的变异(方差)。其中,自相关越大,变异也越大。\n","\n","我们可以通过 `az.mcse(.)` 函数计算 MCSE。"]},{"cell_type":"code","execution_count":151,"metadata":{"collapsed":false,"id":"1DF032834DB647C890A37415A0D35C9D","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","
<xarray.Dataset>\n","Dimensions:      ()\n","Data variables:\n","    good_chains  float64 0.002537\n","    bad_chains0  float64 0.1084\n","    bad_chains1  float64 0.01348
"],"text/plain":["\n","Dimensions: ()\n","Data variables:\n"," good_chains float64 0.002537\n"," bad_chains0 float64 0.1084\n"," bad_chains1 float64 0.01348"]},"execution_count":151,"metadata":{},"output_type":"execute_result"}],"source":["az.mcse(chains)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"067901BE60974F4C8880374C4B2F4557","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["结果显示:good_chains 的 MCSE最小,bad_chains0的结果最差。\n","这与 ESS 的解释相符。"]},{"cell_type":"code","execution_count":152,"metadata":{"collapsed":false,"id":"2CD869272DC64DC88A635D52C5814E99","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
mcse_meanmcse_sdess_bulkess_tailr_hat
good_chains0.0030.0023965.03963.01.00
bad_chains00.1080.0892.011.02.40
bad_chains10.0130.010148.061.01.02
\n","
"],"text/plain":[" mcse_mean mcse_sd ess_bulk ess_tail r_hat\n","good_chains 0.003 0.002 3965.0 3963.0 1.00\n","bad_chains0 0.108 0.089 2.0 11.0 2.40\n","bad_chains1 0.013 0.010 148.0 61.0 1.02"]},"execution_count":152,"metadata":{},"output_type":"execute_result"}],"source":["az.summary(chains, kind=\"diagnostics\")"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3C9D9E22F78B45F0BC396AA2A392F6CC","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["### 如何改进 MCMC 采样过程"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"A001B56C468D43618A4D0F38F0C4CCCB","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["通过**视觉诊断**和**相应的指标**。我们了解了如何诊断 MCMC 样本可能出现的问题。\n","\n","在诊断出 MCMC 样本可能存在的问题后,下一步就是如何修复这些问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"3E1A5A8C21364A42989267DAA4A3EB26","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["常见的解决方法包括:\n","- 增加 MCMC 样本数量。\n"," \n"," 比如将 `trace = pm.sample(draws = 2000)` 中的 draws 扩大为4000及其以上。\n","\n","- 增加 MCMC链的数量。\n"," \n"," 比如将 `trace = pm.sample(draws = 2000, chains=2)` 中的 chains 扩大为4。\n","\n","- 设置burn-in。\n"," \n"," 比如设置 `trace = pm.sample(draws = 2000, tune = 1000)` 这会采样3000个样本,丢掉最开的1000个样本。\n","\t\n","\t这样做的目的在于避免初始值太极端而导致花费太多迭代来达到平稳分布。\n","\t![Image Name](https://cdn.kesci.com/upload/image/rkijy7xre5.png?imageView2/0/w/300/h/640)"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"D549DD485B654EDD8B9F15076BB27795","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["这些方法的原理非常简单,目的都是增加有效样本数量。\n","\n","但对于本身就不能收敛的模型,增加样本数量也是没有用的。\n","\n","需要注意的,MCMC诊断只是 Baysian workflow 的其中一环。\n","\n","当通过诊断不能发现问题,我们需要返回 workflow中通过其他方法去发现潜在的问题,\n","\n","常见问题包括:\n","- 由于先验设置错误导致模型无法收敛。\n","- 变量单位不统一导致参数过大或过小。\n","- 数据量太少导致参数估计不准确。\n","- 模型设定错误,导致模型无法收敛。\n","- 编程代码错误,导致模型采样出现问题。"]},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"F77016AFBEEE4165BFAD3D1DB8606F22","jupyter":{},"notebookId":"63664927d28b18529a403658","scrolled":false,"slideshow":{"slide_type":"slide"},"tags":[],"trusted":true},"source":["除了今天介绍的 MCMC 诊断方法 和 修复方法外,还有很多其他的方法,我们会再之后的实战中进行介绍\n","\n","其他的诊断方法:\n","- Rank Plots \n","![Image Name](https://cdn.kesci.com/upload/image/rkina0ih68.png?imageView2/0/w/320/h/320)\n","- Parallel plots\n","- Separation plots\n","- Divergences (HMC特有的问题及诊断方法)\n","\n","source: https://arviz-devs.github.io/arviz/examples/index.html\n","\n","其他修复方法:\n","- 设置 `target_accept`。\n","- 参见 workflow中的设置先验,参数重整化,模型结构调整等。"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"}},"nbformat":4,"nbformat_minor":0} diff --git a/Zheng/Yuanrui_10_0.ipynb b/Zheng/Yuanrui_10_0.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..322c9cd500e4ede0eee35eedc25c8772a02e5af1 --- /dev/null +++ b/Zheng/Yuanrui_10_0.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","source":"# Lecture 10: linear model and model diagnostics \n\n## Instructor: 胡传鹏(博士)[Dr. Hu Chuan-Peng]\n\n### 南京师范大学心理学院[School of Psychology, Nanjing Normal University]","metadata":{"id":"B2E13C03C0E748919DA40A0B58CB4468","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"## Recap: Linear Model and model diagnositcs\n\n* Workflow\n* MCMC diagnostics","metadata":{"id":"543A644E8F7944B9BAFCD281B8818FDC","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"我们在上节课中采用一个简化的workflow,具体包括如下几个步骤:\n","metadata":{"id":"4F862DA11B554DB289B4E89D1E2922C4","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"\n![Image Name](https://cdn.kesci.com/upload/image/rkvikqg9q6.png?imageView2/0/w/960/h/960)\n","metadata":{"id":"D1C645DC1627451E809A160B16451896","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"现在,我们通过上节课的例子来简单回顾一下workflow","metadata":{"id":"C0D7787982734EC384AD5C3B652E8A0D","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"在本例中,涉及:研究问题、数据收集、选择模型、选择先验、模型拟合、采样过程评估、模型诊断","metadata":{"id":"A2C3D390E20D44858DF0526DD185459C","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"### (1) 提出研究问题\n\n还有研究发现个体创新行为可能与自尊水平有关。\nSES_t:自尊水平;\nEIB_t创新行为\n\n试探究两者之间的关系。","metadata":{"id":"6FF8EC3420AA40608BCFA62EF53D10CB","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"### (2) 数据收集","metadata":{"id":"67488C0B681C40AF8373C9C410AEF7F8","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"code","source":"#加载需要使用的库\n%matplotlib inline\nimport numpy as np \nfrom scipy import stats\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport arviz as az\nimport pymc3 as pm\n# mpl_toolkits.mplot3d是用于三维画图的,Axes3D用于画三维图\nfrom mpl_toolkits.mplot3d import Axes3D\n#数据预处理与可视化\nnp.random.seed(123) # 随机数种子,确保随后生成的随机数相同\ndata = pd.read_csv(\"/home/mw/input/data9464/clean.csv\") # 读取数据,该数据需要提前挂载\ndata['SES_t'] = (data['SES_t'] - data['SES_t'].mean()) / data['SES_t'].std()#将变量进行标准化\n\ndata['EIB_t'] = (data['EIB_t'] - data['EIB_t'].mean()) / data['EIB_t'].std()#将变量进行标准化\n\nplt.scatter(data['SES_t'],data['EIB_t'])\nplt.xlabel('SES_t')\nplt.ylabel('EIB_t')","metadata":{"id":"8DFB687F35914AFB8C05853D55E5D139","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"execute_result","data":{"text/plain":"Text(0, 0.5, 'EIB_t')"},"metadata":{},"execution_count":15},{"output_type":"display_data","data":{"text/plain":"
","text/html":""},"metadata":{"needs_background":"light"},"execution_count":null}],"execution_count":15},{"cell_type":"markdown","source":"### (3) 选择模型","metadata":{"id":"61C5CABC8CA34E878D153FBA69542021","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"尝试构建一个简单的线性模型\n\n线性模型可以用概率的形式进行表达\n\n\n$\\alpha \\sim Normal(0,1)$ -> a $\\sim$ Normal(mu,sigma)\n\n$\\beta \\sim Normal(0,1)$ -> b $\\sim$ Normal(mu,sigma)\n\n$\\sigma \\sim HalfNormal(1)$ -> sigma $\\sim$ HalfNormal(1)\n\n$\\mu_i = \\alpha + \\beta *x$ -> mu = alpha + beta*x \n\n$y \\sim Normal(\\mu_i,sigma)$ -> y $\\sim$ Normal(mu,sigma)","metadata":{"id":"59B8B93E4D324F5DAB64DA0CEB5CCA31","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"markdown","source":"### (4)选择先验","metadata":{"id":"C24612EBE8DE46538C3C68A040CD7B57","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"code","source":"# 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n# 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n# with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\nwith pm.Model() as linear_model:\n # 先验分布: alpha, beta, sigma这三个参数是随机变量\n alpha = pm.Normal('alpha',mu=0,sd=1)\n beta = pm.Normal('beta',mu=0,sd=1, shape=1) \n sigma = pm.HalfNormal('sigma',sd=1)\n \n # 先验预测检查\n prior_checks = pm.sample_prior_predictive(samples=50)","metadata":{"id":"8E89C676474E4E2AA6E63B19D7FA8B90","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[],"execution_count":16},{"cell_type":"code","source":"fig, ax = plt.subplots()\nx1 = np.linspace(-3, 3, 50) #生成从-2,2之间的50个假数据\n\nfor a, b in zip(prior_checks[\"alpha\"], prior_checks[\"beta\"]):\n y = a + b * x1 # 基于假数据生成预测值\n ax.plot(x1,y)","metadata":{"id":"47053D46ABF54C6FB352FD0DF7715D11","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"display_data","data":{"text/plain":"
","text/html":""},"metadata":{"needs_background":"light"},"execution_count":null}],"execution_count":17},{"cell_type":"markdown","source":"### (5) 拟合数据","metadata":{"id":"9DA3B2B9238B492D85420F56F0A3C620","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"code","source":"linear_model = pm.Model()\nwith linear_model :\n # 在pymc3中,pm.Model()定义了一个新的模型对象,这个对象是模型中随机变量的容器\n # 在python中,容器是一种数据结构,是用来管理特殊数据的对象\n # with语句定义了一个上下文管理器,以 linear_model 作为上下文,在这个上下文中定义的变量都被添加到这个模型\n alpha = pm.Normal('alpha',mu=0,sd=1)\n beta = pm.Normal('beta',mu=0,sd=1,shape=1)\n sigma = pm.HalfNormal('sigma',sd=1)\n # x为自变量,是之前已经载入的数据\n x = pm.Data(\"x\", data['SES_t'])\n # 线性模型:mu是确定性随机变量,这个变量的值完全由右端值确定\n mu = pm.Deterministic(\"mu\", alpha + beta*x) \n # Y的观测值,这是一个特殊的观测随机变量,表示模型数据的可能性。也可以表示模型的似然,通过 observed 参数来告诉这个变量其值是已经被观测到了的,不会被拟合算法改变\n y_obs = pm.Normal('y_obs',mu=mu,sd=sigma,observed=data['EIB_t'])","metadata":{"id":"8E7C0A6A910546508B4063B77A8C8B8B","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[],"execution_count":18},{"cell_type":"markdown","source":"### (6)采样过程诊断\n\n如果使用MCMC对后验进行近似,则需要首先对MCMC过程进行评估。\n\n* 是否收敛;\n* 是否接近真实的后验。\n\n对采样过程的评估我们会采用目视检查或rhat这个指标","metadata":{"id":"9DE229BD2025423B85C917275D6C3E63","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"code","source":"with linear_model :\n # 使用mcmc方法进行采样,draws为采样次数,tune为调整采样策略的次数,这些次数将在采样结束后被丢弃,\n # target_accept为接受率, return_inferencedata=True为该函数返回的对象是arviz.InnferenceData对象\n # chains为我们采样的链数,cores为我们的调用的cpu数,多个链可以在多个cpu中并行计算,我们在和鲸中调用的cpu数为2\n trace = pm.sample(draws = 2000, tune=1000, target_accept=0.9,chains=2, cores= 2,return_inferencedata=True)","metadata":{"id":"7776FCCFB2C5441893BE0B223D3C3B89","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"stream","name":"stderr","text":"Auto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [sigma, beta, alpha]\n"},{"output_type":"display_data","data":{"text/plain":"","text/html":"\n\n"},"metadata":{},"execution_count":null},{"output_type":"stream","name":"stderr","text":"Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n"}],"execution_count":19},{"cell_type":"code","source":"az.plot_trace(trace,var_names=['alpha','beta','sigma'])","metadata":{"id":"215CDFEF736E45A68BCFFCB03E74A7B6","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"execute_result","data":{"text/plain":"array([[,\n ],\n [,\n ],\n [,\n ]], dtype=object)"},"metadata":{},"execution_count":20},{"output_type":"display_data","data":{"text/plain":"
","text/html":""},"metadata":{"needs_background":"light"},"execution_count":null}],"execution_count":20},{"cell_type":"code","source":"az.summary(trace, var_names=['alpha','beta','sigma'], kind=\"diagnostics\")","metadata":{"id":"9C6692072E1C44D78988AF65F5E5A4ED","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"execute_result","data":{"text/plain":" mcse_mean mcse_sd ess_bulk ess_tail r_hat\nalpha 0.0 0.0 5018.0 2745.0 1.0\nbeta[0] 0.0 0.0 4503.0 3309.0 1.0\nsigma 0.0 0.0 4367.0 3085.0 1.0","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
mcse_meanmcse_sdess_bulkess_tailr_hat
alpha0.00.05018.02745.01.0
beta[0]0.00.04503.03309.01.0
sigma0.00.04367.03085.01.0
\n
"},"metadata":{},"execution_count":21}],"execution_count":21},{"cell_type":"markdown","source":"### (7)模型诊断\n\n在MCMC有效的前提下,需要继续检验模型是否能够较好地拟合数据。\n\n我们会使用后验预测分布通过我们得到的参数生成一批模拟数据,并将其与真实数据进行对比。","metadata":{"id":"0C5F6A4325D84CF58C46E8E009EB0CA3","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true}},{"cell_type":"code","source":"# 后验预测分布的计算仍在容器中进行\nwith linear_model:\n # pm.sample_posterior_predictive()利用trace.posterior的后验分布计算后验预测分布\n ppc_y = pm.sample_posterior_predictive(trace.posterior) \n#将ppc_y转化为InferenceData对象合并到trace中\naz.concat(trace, az.from_pymc3(posterior_predictive=ppc_y), inplace=True)","metadata":{"id":"08DBC6B10B134772855BE0F0CF010E27","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"display_data","data":{"text/plain":"","text/html":"\n\n"},"metadata":{},"execution_count":null},{"output_type":"update_display_data","data":{"text/plain":"","text/html":"\n
\n \n 100.00% [4000/4000 00:06<00:00]\n
\n "},"metadata":{},"execution_count":null},{"output_type":"stream","name":"stderr","text":"/opt/conda/lib/python3.7/site-packages/arviz/data/io_pymc3.py:100: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.\n FutureWarning,\n"}],"execution_count":22},{"cell_type":"code","source":"# 绘制后验预测分布\naz.plot_ppc(trace)","metadata":{"id":"B3201FD32B234C05BD6942227CDE422E","notebookId":"63664e1ed28b18529a408653","jupyter":{},"collapsed":false,"scrolled":false,"tags":[],"slideshow":{"slide_type":"slide"},"trusted":true},"outputs":[{"output_type":"execute_result","data":{"text/plain":""},"metadata":{},"execution_count":23},{"output_type":"stream","name":"stderr","text":"/opt/conda/lib/python3.7/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n func(*args, **kwargs)\n"},{"output_type":"display_data","data":{"text/plain":"
","text/html":""},"metadata":{"needs_background":"light"},"execution_count":null}],"execution_count":23}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python","nbconvert_exporter":"python","file_extension":".py","version":"3.5.2","pygments_lexer":"ipython3"}},"nbformat":4,"nbformat_minor":0} \ No newline at end of file