使用sed和awk对出租车数据进行清洗

拿到出租车数据,要做大量的数据清洗工作。里面经常有各种奇怪的错误。本文记录了使用sed和awk对出租车数据进行清洗所解决的几个问题。

看第一天的文本数据,感觉还比较正常,

head 20100501.txt 

20100430,235958,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

20100430,235958,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

20100430,235957,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

20100430,235958,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

20100430,235958,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

20100430,235955,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

20100430,235958,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

20100430,235958,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

20100430,235958,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

20100430,235958,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

这里存在的问题比较明显,第一列和第二列明显是个日期和时间,但和最后一列的表达形式完全不一致,倒数第二列还多出一个分号。

1. 使用sed解决第一个问题:

可以使用sed将行首的数字提取出来,并将其格式类型转换为最后一列的格式,即:

 

sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

用头10行数据试验,发现可以得到正确结果:

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

2010-04-30 23:59:58,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

2010-04-30 23:59:57,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

2010-04-30 23:59:55,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

2.  使用awk处理多出来的分号:

awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

这样可以将分号处理掉,并将每行的数据按照自己想要的顺序进行排序。

继续试验,发现可以得到想要的结果:

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

11087,1,33.6,150,121.569343,31.264883,2010-04-30 23:59:58,2010-04-30 23:59:59

11186,1,0,250,121.450367,31.340275,2010-04-30 23:59:58,2010-04-30 23:59:59

16569,1,8.4,38,121.434792,31.243405,2010-04-30 23:59:57,2010-04-30 23:59:59

11573,1,73.9,77,121.646227,31.267015,2010-04-30 23:59:58,2010-04-30 23:59:59

11199,1,49.3,252,121.357743,31.296647,2010-04-30 23:59:58,2010-04-30 23:59:59

10349,1,15.7,108,121.451343,31.247853,2010-04-30 23:59:55,2010-04-30 23:59:59

11277,1,0,353,121.463202,31.199197,2010-04-30 23:59:58,2010-04-30 23:59:59

11194,1,0,336,121.399487,31.245877,2010-04-30 23:59:58,2010-04-30 23:59:59

11790,1,0,245,121.42379,31.137098,2010-04-30 23:59:58,2010-04-30 23:59:59

18378,1,22.9,155,121.285398,31.32229,2010-04-30 23:59:58,2010-04-30 23:59:59

但对全文进行处理时,发现了其他的错误,有的数据行不是以2010打头的,有的数据行第二列不是6位数字,比如这些数据:

20112248,49,QS,11768,121.501867,31.207212,62.9,292,1;,2010-05-01 22:48:50

20150122,5108,QS,11646,121.377197,31.18925,8.3,196,1;,2010-05-01 22:51:10

20150122,5122,QS,10181,121.497255,31.241982,0,231,1;,2010-05-01 22:51:23

20150122,5419,QS,13711,121.494252,31.228403,45.5,143,1;,2010-05-01 22:54:20

20150122,5644,QS,10454,121.610453,31.180535,89.5,62,1;,2010-05-01 22:56:44

20192452,,QS,17153,121.474945,31.228778,0,153,1;,2010-05-01 09:24:53

20100501,2238,QS,15257,121.451717,31.33714,0,341,1;,2010-05-01 22:38:29

20100501,2244,QS,17624,121.413983,31.232202,0,88,1;,2010-05-01 22:41:55

20100501,2247,QS,18334,121.452265,31.117727,0,255,1;,2010-05-01 22:44:18

20100501,2246,QS,16304,121.432092,31.236767,23.4,239,1;,2010-05-01 22:45:08

20100501,2243,QS,17050,121.441003,31.226472,0,184,1;,2010-05-01 22:46:24

 

发现其中2015打头的数据还有规律可循,其他年份的数据规律不太明显(决定丢弃),第二列变为4位数字的看起来也有规律(只是秒数数据丢失,对研究而言不重要,决定用00填充为秒数)。

对于2015打头的数据,可以使用结合sed处理:

sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%'

而对于第二列为4位数字的其他数据,可以结合sed和awk处理:

awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%'

最终定制了一个bash脚本进行处理,输入参数为要处理的日期:

#! /bin/bash

todate=$1

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 6)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:\6,%' |  sed 's%;%%' | awk -F, 'BEGIN{OFS=","} {print $3,$8,$6,$7,$4,$5,$1,$9,$10}' > $todate.csv

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep ^2015 $todate.txt | sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep -v QS $todate.csv > $todate-1.csv

mv $todate-1.csv $todate.csv

 


已发布

分类

来自

标签:

评论

《 “使用sed和awk对出租车数据进行清洗” 》 有 13 条评论

  1. purchase androxal cheap next day delivery

    discount androxal american express canada

  2. ordering enclomiphene cheap fast shipping

    ordering enclomiphene purchase online from india

  3. buy cheap rifaximin australia discount

    how to buy rifaximin ireland over the counter

  4. cheap xifaxan uk cheap purchase buy

    canada cheap xifaxan

  5. buy cheap staxyn cheap genuine

    buy cheap staxyn australia online no prescription

  6. how to buy avodart australia cheap

    purchase avodart purchase prescription

  7. discount dutasteride generic in us

    get dutasteride uk cheapest

  8. discount flexeril cyclobenzaprine lowest price

    cheap flexeril cyclobenzaprine no prescription online

  9. is generic gabapentin available

    buy cheap gabapentin cheap in canada

  10. online order fildena buy online uk

    cheapest buy fildena generic online uk

  11. how to order itraconazole australia over the counter

    how to order itraconazole cost of tablet

  12. kamagra s doruト稿nテュm v sobotu

    online kamagra

  13. generique kamagra prescrire eu medicament pharmacie

    kamagra de pharmaciens canadiens

回复 how to order enclomiphene new zealand buy online 取消回复

您的邮箱地址不会被公开。 必填项已用 * 标注