使用sed和awk对出租车数据进行清洗

拿到出租车数据,要做大量的数据清洗工作。里面经常有各种奇怪的错误。本文记录了使用sed和awk对出租车数据进行清洗所解决的几个问题。

看第一天的文本数据,感觉还比较正常,

head 20100501.txt 

20100430,235958,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

20100430,235958,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

20100430,235957,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

20100430,235958,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

20100430,235958,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

20100430,235955,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

20100430,235958,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

20100430,235958,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

20100430,235958,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

20100430,235958,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

这里存在的问题比较明显,第一列和第二列明显是个日期和时间,但和最后一列的表达形式完全不一致,倒数第二列还多出一个分号。

1. 使用sed解决第一个问题:

可以使用sed将行首的数字提取出来,并将其格式类型转换为最后一列的格式,即:

 

sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

用头10行数据试验,发现可以得到正确结果:

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

2010-04-30 23:59:58,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

2010-04-30 23:59:57,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

2010-04-30 23:59:55,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

2.  使用awk处理多出来的分号:

awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

这样可以将分号处理掉,并将每行的数据按照自己想要的顺序进行排序。

继续试验,发现可以得到想要的结果:

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

11087,1,33.6,150,121.569343,31.264883,2010-04-30 23:59:58,2010-04-30 23:59:59

11186,1,0,250,121.450367,31.340275,2010-04-30 23:59:58,2010-04-30 23:59:59

16569,1,8.4,38,121.434792,31.243405,2010-04-30 23:59:57,2010-04-30 23:59:59

11573,1,73.9,77,121.646227,31.267015,2010-04-30 23:59:58,2010-04-30 23:59:59

11199,1,49.3,252,121.357743,31.296647,2010-04-30 23:59:58,2010-04-30 23:59:59

10349,1,15.7,108,121.451343,31.247853,2010-04-30 23:59:55,2010-04-30 23:59:59

11277,1,0,353,121.463202,31.199197,2010-04-30 23:59:58,2010-04-30 23:59:59

11194,1,0,336,121.399487,31.245877,2010-04-30 23:59:58,2010-04-30 23:59:59

11790,1,0,245,121.42379,31.137098,2010-04-30 23:59:58,2010-04-30 23:59:59

18378,1,22.9,155,121.285398,31.32229,2010-04-30 23:59:58,2010-04-30 23:59:59

但对全文进行处理时,发现了其他的错误,有的数据行不是以2010打头的,有的数据行第二列不是6位数字,比如这些数据:

20112248,49,QS,11768,121.501867,31.207212,62.9,292,1;,2010-05-01 22:48:50

20150122,5108,QS,11646,121.377197,31.18925,8.3,196,1;,2010-05-01 22:51:10

20150122,5122,QS,10181,121.497255,31.241982,0,231,1;,2010-05-01 22:51:23

20150122,5419,QS,13711,121.494252,31.228403,45.5,143,1;,2010-05-01 22:54:20

20150122,5644,QS,10454,121.610453,31.180535,89.5,62,1;,2010-05-01 22:56:44

20192452,,QS,17153,121.474945,31.228778,0,153,1;,2010-05-01 09:24:53

20100501,2238,QS,15257,121.451717,31.33714,0,341,1;,2010-05-01 22:38:29

20100501,2244,QS,17624,121.413983,31.232202,0,88,1;,2010-05-01 22:41:55

20100501,2247,QS,18334,121.452265,31.117727,0,255,1;,2010-05-01 22:44:18

20100501,2246,QS,16304,121.432092,31.236767,23.4,239,1;,2010-05-01 22:45:08

20100501,2243,QS,17050,121.441003,31.226472,0,184,1;,2010-05-01 22:46:24

 

发现其中2015打头的数据还有规律可循,其他年份的数据规律不太明显(决定丢弃),第二列变为4位数字的看起来也有规律(只是秒数数据丢失,对研究而言不重要,决定用00填充为秒数)。

对于2015打头的数据,可以使用结合sed处理:

sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%'

而对于第二列为4位数字的其他数据,可以结合sed和awk处理:

awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%'

最终定制了一个bash脚本进行处理,输入参数为要处理的日期:

#! /bin/bash

todate=$1

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 6)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:\6,%' |  sed 's%;%%' | awk -F, 'BEGIN{OFS=","} {print $3,$8,$6,$7,$4,$5,$1,$9,$10}' > $todate.csv

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep ^2015 $todate.txt | sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep -v QS $todate.csv > $todate-1.csv

mv $todate-1.csv $todate.csv

 


已发布

分类

来自

标签:

评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注