博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[python case]DataCamp中Dr. Semmelweis and the discovery of handwashing案例
阅读量:5939 次
发布时间:2019-06-19

本文共 7914 字,大约阅读时间需要 26 分钟。

#本人数据新手(real - - ),前几天刚刚接触datacamp,感觉还蛮有趣。基本上所有练习都由浅入深,大多数只要能看懂英文大意即可完成。

#接下来如果有时间的话计划整理一些学习体会。

#如果有一起学习datacamp的小伙伴欢迎留言,一起学习。

 

 

#title

Dr. Semmelweis and the discovery of handwashing

##summary

Reanalyse the data behind one of the most important discoveries of modern medicine: Handwashing.

##skill

pandas foudations

 

整个故事以1847年Ignaz Semmelweis的发现为背景:

In 1847 the Hungarian physician Ignaz Semmelweis makes a breakthough discovery: He discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.

整个project分9块:

  • 1.Meet Dr. Ignaz Semmelweis
  • 2.The alarming number of deaths
  • 3.Death at the clinics
  • 4.The handwashing begins
  • 5.The effect of handwashing
  • 6.The effect of handwashing highlighted
  • 7.More handwashing, fewer deaths?
  • 8.A Bootstrap analysis of Semmelweis handwashing data
  • 9.The fate of Dr. Semmelweis

下面每部分注释一下,整体比较基础(real = =),不过还是希望能够帮助和我一样对python做数据分析的门外汉们。

1. Meet Dr. Ignaz Semmelweis

老教授看见这样一组数据,刚生过小孩的妈妈们经常会因为一种child fever的病而不幸去世,于是他调查得到了一些数据:

# importing modules# ... YOUR CODE FOR TASK 1 ...import pandas as pd #导入pandas 以pd作为简称import csv #导入csv# Read datasets/yearly_deaths_by_clinic.csv into yearlyyearly = pd.read_csv('datasets/yearly_deaths_by_clinic.csv')#利用pd.read_csv将csv文件导入yearly变量中# Print out yearly# ... YOUR CODE FOR TASK 1 ...print(yearly)#输出yearly检查变量

output:

year  births  deaths    clinic0   1841    3036     237  clinic 11   1842    3287     518  clinic 12   1843    3060     274  clinic 13   1844    3157     260  clinic 14   1845    3492     241  clinic 15   1846    4010     459  clinic 16   1841    2442      86  clinic 27   1842    2659     202  clinic 28   1843    2739     164  clinic 29   1844    2956      68  clinic 210  1845    3241      66  clinic 211  1846    3754     105  clinic 2

2. The alarming number of deaths

经过上面的输出,老教授感觉事情没那么简单:

# Calculate proportion of deaths per no. births# ... YOUR CODE FOR TASK 2 ...yearly["proportion_deaths"]=yearly['deaths']/yearly['births']#增加proportion_deaths死亡率列# Extract clinic 1 data into yearly1 and clinic 2 data into yearly2yearly1 = yearly.loc[yearly['clinic']=='clinic 1']#提取含clinic1的行,利用loc函数yearly2 = yearly.loc[yearly['clinic']=='clinic 2']#提取含clinic2的行print(yearly1)# Print out yearly1# ... YOUR CODE FOR TASK 2 ...

output:

year  births  deaths    clinic  proportion_deaths0  1841    3036     237  clinic 1           0.0780631  1842    3287     518  clinic 1           0.1575912  1843    3060     274  clinic 1           0.0895423  1844    3157     260  clinic 1           0.0823574  1845    3492     241  clinic 1           0.0690155  1846    4010     459  clinic 1           0.114464

 

loc函数参考:

要选择列值等于标量some​​_value的行,请使用==:

df.loc[df['column_name'] == some_value]

3. Death at the clinics

做成图之后就更加直观和明显了:

# This makes plots appear in the notebook%matplotlib inline#magic method# Plot yearly proportion of deaths at the two clinics# ... YOUR CODE FOR TASK 3 ...ax = yearly1.plot(x="year", y="proportion_deaths",label="clinic1")#利用plot函数画图,x轴为年,y轴为死亡率,label添加图例,为了yearly1和yearly2同轴(图)显示,将此图名为axyearly2.plot(x="year", y="proportion_deaths",label="clinic2", ax=ax)#利用ax=ax可以实现同轴显示ax.set_ylabel("Proportion deaths")#设置y轴名命令,sex_ylabel("name")

output:

plot函数参考:

 

4. The handwashing begins

根据前面的一顿操作分析可以得知,clinic1的死亡率要比clinic2高,这是为什么呢(挠头).原来奥,clinic1的接生同学们还兼职了对尸体的研究.于是教授下令,以后研究完尸体之后必须洗手!!!然后又收集了41年到49年的数据:

# Read datasets/monthly_deaths.csv into monthlymonthly = pd.read_csv("datasets/monthly_deaths.csv",parse_dates=["date"])#导入新的csv到monthly,这里parse_dates是定义date下数据为时间数据(而非字符串),从而具有时序的特性(可以比较先后)# Calculate proportion of deaths per no. births# ... YOUR CODE FOR TASK 4 ...monthly["proportion_deaths"]=monthly["deaths"]/monthly["births"]# Print out the first rows in monthly# ... YOUR CODE FOR TASK 4 ...print(monthly.head(3))#输出部分(前三行)monthly数据

5. The effect of handwashing

洗了手之后有没有效果呢:

# Plot monthly proportion of deaths# ... YOUR CODE FOR TASK 5 ...ax=monthly.plot(x="date",y="proportion_deaths",label="deaths after handwashing")ax.set_ylabel("Proportion deaths")

output:

6. The effect of handwashing highlighted

挖草,好像那个线确实下降了哎,不过不太明显哦:

# Date when handwashing was made mandatoryimport pandas as pdhandwashing_start = pd.to_datetime('1847-06-01')#标注'洗手事变'开始时间# Split monthly into before and after handwashing_startbefore_washing = monthly.loc[monthly['date']
=handwashing_start]# Plot monthly proportion of deaths before and after handwashing# ... YOUR CODE FOR TASK 6 ...ax=before_washing.plot(x='date',y='proportion_deaths',label='before washing')after_washing.plot(x='date',y='proportion_deaths',label='after washing',ax=ax)ax.set_ylabel="Proportion deaths"

output:

7. More handwashing, fewer deaths?

这下牛逼了奥,看着清晰明了.但是洗手和死亡率的降低真的有关系吗?求个平均值看看

# Difference in mean monthly proportion of deaths due to handwashingbefore_proportion = before_washing['proportion_deaths']after_proportion = after_washing['proportion_deaths']mean_diff = after_proportion.mean()-before_proportion.mean()mean_diff

output:

-0.08395660751183336

 

8. A Bootstrap analysis of Semmelweis handwashing data

可以看到死亡率确实减小了8%左右,看来洗手是真的有用.但是数据科学家感觉事情并没有结束.又用了bootstrap analysis(自助法?)

参考: 

 (too long no see)

# A bootstrap analysis of the reduction of deaths due to handwashingboot_mean_diff = []#定义一个空listfor i in range(3000):#做一个3000次的实验    boot_before = before_proportion.sample(frac=1,replace=True)#frac=1->全部重新排序,并放回    boot_after = after_proportion.sample(frac=1,replace=True)    boot_mean_diff.append(boot_after.mean() - boot_before.mean())计算一次均值差,加入boot_mean_diff中    # Calculating a 95% confidence interval from boot_mean_diff #计算boot_mean_diff置信区间confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])confidence_interval

这里有一个地方不是很明白,为什么3000次实验每次只将before_proportion和after_proportion打乱顺序,然后求平均值做差,但boot_after.mean() - boot_before.mean()的结果都不一样呢?(不应该是一样的吗?所有样本都摆在那里求平均值跟什么顺序摆放的也没关系啊?)

print(boot_mean_diff[0:20])

 

[-0.07787261202620424, -0.07424799825364967, -0.09358312005955502, -0.08894556209810614, -0.08087685009098905, -0.06429190709356139, -0.08068023440789948, -0.07240951438539092, -0.06750565112365006, -0.0676633601804324, -0.08713968457785505, -0.08382681590118775, -0.08280812612089627, -0.08059191110129257, -0.09227479693648963, -0.07786725171910112, -0.08150749269654012, -0.08903607701866195, -0.061659787819670464, -0.0809971940784796]

自问自答(shoegazing - - ):

因为replace=True,所以每次个sample都是从所有(始终是一开始的样本)中抽选的,可能重复,也可能不重复。

test:

import pandas as pda=pd.read_csv('C:/Users/chenchutong/Desktop/1.csv')#随便选了一个数据的一段做实验b=a['births']print(b)print('---------')print(b.sample(frac=1,replace=True))

output:

0    2541    2392    2773    255Name: births, dtype: int64---------0    2541    2391    2392    277Name: births, dtype: int64

可以看到有放回的情况下sample出来的样本是可能重复的,这造成了mean值的不同。

 

 

9. The fate of Dr. Semmelweis

So handwashing reduced the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.

The tragedy is that, despite the evidence, Semmelweis' theory — that childbed fever was caused by some "substance" (what we today know as bacteria) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.

One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn't show any graphs nor confidence intervals. If he would have had access to the analysis we've just put together he might have been more successful in getting the Viennese doctors to wash their hands.

# The data Semmelweis collected points to that:doctors_should_wash_their_hands = True

 

that's all thank you~~~

dataset:

 

你可能感兴趣的文章
C语言运算符优先级相关问题
查看>>
MP4视频播放器代码
查看>>
Nginx 匹配 iphone Android 微信
查看>>
ldap
查看>>
我的友情链接
查看>>
Yum软件仓库配置
查看>>
linux 压缩与解压总结
查看>>
mysql脚本1064 - You have an error in your SQL syntax; check the manual
查看>>
nessus 本地扫描(一)
查看>>
linux服务器磁盘陈列
查看>>
交换机配置模式
查看>>
python----tcp/ip http
查看>>
我的友情链接
查看>>
第一本docker书学习笔记1-3章
查看>>
一個典型僵尸網絡淺析
查看>>
vmware克隆Centos6.4虚拟机网卡无法启动问题
查看>>
dba学习
查看>>
asterisk配置
查看>>
GA操作步骤和技巧(二)——用户行为分析
查看>>
shell中while循环里使用ssh的注意事项
查看>>