本文共 7914 字,大约阅读时间需要 26 分钟。
#本人数据新手(real - - ),前几天刚刚接触datacamp,感觉还蛮有趣。基本上所有练习都由浅入深,大多数只要能看懂英文大意即可完成。
#接下来如果有时间的话计划整理一些学习体会。
#如果有一起学习datacamp的小伙伴欢迎留言,一起学习。
#title
Dr. Semmelweis and the discovery of handwashing
##summary
Reanalyse the data behind one of the most important discoveries of modern medicine: Handwashing.
##skill
pandas foudations
整个故事以1847年Ignaz Semmelweis的发现为背景:
In 1847 the Hungarian physician Ignaz Semmelweis makes a breakthough discovery: He discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.
整个project分9块:
- 1.Meet Dr. Ignaz Semmelweis
- 2.The alarming number of deaths
- 3.Death at the clinics
- 4.The handwashing begins
- 5.The effect of handwashing
- 6.The effect of handwashing highlighted
- 7.More handwashing, fewer deaths?
- 8.A Bootstrap analysis of Semmelweis handwashing data
- 9.The fate of Dr. Semmelweis
下面每部分注释一下,整体比较基础(real = =),不过还是希望能够帮助和我一样对python做数据分析的门外汉们。
老教授看见这样一组数据,刚生过小孩的妈妈们经常会因为一种child fever的病而不幸去世,于是他调查得到了一些数据:
# importing modules# ... YOUR CODE FOR TASK 1 ...import pandas as pd #导入pandas 以pd作为简称import csv #导入csv# Read datasets/yearly_deaths_by_clinic.csv into yearlyyearly = pd.read_csv('datasets/yearly_deaths_by_clinic.csv')#利用pd.read_csv将csv文件导入yearly变量中# Print out yearly# ... YOUR CODE FOR TASK 1 ...print(yearly)#输出yearly检查变量
output:
year births deaths clinic0 1841 3036 237 clinic 11 1842 3287 518 clinic 12 1843 3060 274 clinic 13 1844 3157 260 clinic 14 1845 3492 241 clinic 15 1846 4010 459 clinic 16 1841 2442 86 clinic 27 1842 2659 202 clinic 28 1843 2739 164 clinic 29 1844 2956 68 clinic 210 1845 3241 66 clinic 211 1846 3754 105 clinic 2
经过上面的输出,老教授感觉事情没那么简单:
# Calculate proportion of deaths per no. births# ... YOUR CODE FOR TASK 2 ...yearly["proportion_deaths"]=yearly['deaths']/yearly['births']#增加proportion_deaths死亡率列# Extract clinic 1 data into yearly1 and clinic 2 data into yearly2yearly1 = yearly.loc[yearly['clinic']=='clinic 1']#提取含clinic1的行,利用loc函数yearly2 = yearly.loc[yearly['clinic']=='clinic 2']#提取含clinic2的行print(yearly1)# Print out yearly1# ... YOUR CODE FOR TASK 2 ...
output:
year births deaths clinic proportion_deaths0 1841 3036 237 clinic 1 0.0780631 1842 3287 518 clinic 1 0.1575912 1843 3060 274 clinic 1 0.0895423 1844 3157 260 clinic 1 0.0823574 1845 3492 241 clinic 1 0.0690155 1846 4010 459 clinic 1 0.114464
loc函数参考:
要选择列值等于标量some_value的行,请使用==:
df.loc[df['column_name'] == some_value]
做成图之后就更加直观和明显了:
# This makes plots appear in the notebook%matplotlib inline#magic method# Plot yearly proportion of deaths at the two clinics# ... YOUR CODE FOR TASK 3 ...ax = yearly1.plot(x="year", y="proportion_deaths",label="clinic1")#利用plot函数画图,x轴为年,y轴为死亡率,label添加图例,为了yearly1和yearly2同轴(图)显示,将此图名为axyearly2.plot(x="year", y="proportion_deaths",label="clinic2", ax=ax)#利用ax=ax可以实现同轴显示ax.set_ylabel("Proportion deaths")#设置y轴名命令,sex_ylabel("name")
output:
plot函数参考:
根据前面的一顿操作分析可以得知,clinic1的死亡率要比clinic2高,这是为什么呢(挠头).原来奥,clinic1的接生同学们还兼职了对尸体的研究.于是教授下令,以后研究完尸体之后必须洗手!!!然后又收集了41年到49年的数据:
# Read datasets/monthly_deaths.csv into monthlymonthly = pd.read_csv("datasets/monthly_deaths.csv",parse_dates=["date"])#导入新的csv到monthly,这里parse_dates是定义date下数据为时间数据(而非字符串),从而具有时序的特性(可以比较先后)# Calculate proportion of deaths per no. births# ... YOUR CODE FOR TASK 4 ...monthly["proportion_deaths"]=monthly["deaths"]/monthly["births"]# Print out the first rows in monthly# ... YOUR CODE FOR TASK 4 ...print(monthly.head(3))#输出部分(前三行)monthly数据
洗了手之后有没有效果呢:
# Plot monthly proportion of deaths# ... YOUR CODE FOR TASK 5 ...ax=monthly.plot(x="date",y="proportion_deaths",label="deaths after handwashing")ax.set_ylabel("Proportion deaths")
output:
挖草,好像那个线确实下降了哎,不过不太明显哦:
# Date when handwashing was made mandatoryimport pandas as pdhandwashing_start = pd.to_datetime('1847-06-01')#标注'洗手事变'开始时间# Split monthly into before and after handwashing_startbefore_washing = monthly.loc[monthly['date']=handwashing_start]# Plot monthly proportion of deaths before and after handwashing# ... YOUR CODE FOR TASK 6 ...ax=before_washing.plot(x='date',y='proportion_deaths',label='before washing')after_washing.plot(x='date',y='proportion_deaths',label='after washing',ax=ax)ax.set_ylabel="Proportion deaths"
output:
这下牛逼了奥,看着清晰明了.但是洗手和死亡率的降低真的有关系吗?求个平均值看看
# Difference in mean monthly proportion of deaths due to handwashingbefore_proportion = before_washing['proportion_deaths']after_proportion = after_washing['proportion_deaths']mean_diff = after_proportion.mean()-before_proportion.mean()mean_diff
output:
-0.08395660751183336
可以看到死亡率确实减小了8%左右,看来洗手是真的有用.但是数据科学家感觉事情并没有结束.又用了bootstrap analysis(自助法?)
参考:
(too long no see)
# A bootstrap analysis of the reduction of deaths due to handwashingboot_mean_diff = []#定义一个空listfor i in range(3000):#做一个3000次的实验 boot_before = before_proportion.sample(frac=1,replace=True)#frac=1->全部重新排序,并放回 boot_after = after_proportion.sample(frac=1,replace=True) boot_mean_diff.append(boot_after.mean() - boot_before.mean())计算一次均值差,加入boot_mean_diff中 # Calculating a 95% confidence interval from boot_mean_diff #计算boot_mean_diff置信区间confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])confidence_interval
这里有一个地方不是很明白,为什么3000次实验每次只将before_proportion和after_proportion打乱顺序,然后求平均值做差,但boot_after.mean() - boot_before.mean()的结果都不一样呢?(不应该是一样的吗?所有样本都摆在那里求平均值跟什么顺序摆放的也没关系啊?)
print(boot_mean_diff[0:20])
[-0.07787261202620424, -0.07424799825364967, -0.09358312005955502, -0.08894556209810614, -0.08087685009098905, -0.06429190709356139, -0.08068023440789948, -0.07240951438539092, -0.06750565112365006, -0.0676633601804324, -0.08713968457785505, -0.08382681590118775, -0.08280812612089627, -0.08059191110129257, -0.09227479693648963, -0.07786725171910112, -0.08150749269654012, -0.08903607701866195, -0.061659787819670464, -0.0809971940784796]
自问自答(shoegazing - - ):
因为replace=True,所以每次个sample都是从所有(始终是一开始的样本)中抽选的,可能重复,也可能不重复。
test:
import pandas as pda=pd.read_csv('C:/Users/chenchutong/Desktop/1.csv')#随便选了一个数据的一段做实验b=a['births']print(b)print('---------')print(b.sample(frac=1,replace=True))
output:
0 2541 2392 2773 255Name: births, dtype: int64---------0 2541 2391 2392 277Name: births, dtype: int64
可以看到有放回的情况下sample出来的样本是可能重复的,这造成了mean值的不同。
So handwashing reduced the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.
The tragedy is that, despite the evidence, Semmelweis' theory — that childbed fever was caused by some "substance" (what we today know as bacteria) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.
One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn't show any graphs nor confidence intervals. If he would have had access to the analysis we've just put together he might have been more successful in getting the Viennese doctors to wash their hands.
# The data Semmelweis collected points to that:doctors_should_wash_their_hands = True
that's all thank you~~~
dataset: