使用sed和awk对出租车数据进行清洗

拿到出租车数据，要做大量的数据清洗工作。里面经常有各种奇怪的错误。本文记录了使用sed和awk对出租车数据进行清洗所解决的几个问题。

看第一天的文本数据，感觉还比较正常，

head 20100501.txt

20100430,235958,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

20100430,235958,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

20100430,235957,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

20100430,235958,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

20100430,235958,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

20100430,235955,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

20100430,235958,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

20100430,235958,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

20100430,235958,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

20100430,235958,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

这里存在的问题比较明显，第一列和第二列明显是个日期和时间，但和最后一列的表达形式完全不一致，倒数第二列还多出一个分号。

1. 使用sed解决第一个问题：

可以使用sed将行首的数字提取出来，并将其格式类型转换为最后一列的格式，即：

sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

用头10行数据试验，发现可以得到正确结果：

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%'

2010-04-30 23:59:58,QS,11087,121.569343,31.264883,33.6,150,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11186,121.450367,31.340275,0,250,1;,2010-04-30 23:59:59

2010-04-30 23:59:57,QS,16569,121.434792,31.243405,8.4,38,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11573,121.646227,31.267015,73.9,77,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11199,121.357743,31.296647,49.3,252,1;,2010-04-30 23:59:59

2010-04-30 23:59:55,QS,10349,121.451343,31.247853,15.7,108,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11277,121.463202,31.199197,0,353,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11194,121.399487,31.245877,0,336,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,11790,121.42379,31.137098,0,245,1;,2010-04-30 23:59:59

2010-04-30 23:59:58,QS,18378,121.285398,31.32229,22.9,155,1;,2010-04-30 23:59:59

2. 使用awk处理多出来的分号：

awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

这样可以将分号处理掉，并将每行的数据按照自己想要的顺序进行排序。

继续试验，发现可以得到想要的结果：

head 20100501.txt | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)%\1-\2-\3 \4:\5:\6%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9}'

11087,1,33.6,150,121.569343,31.264883,2010-04-30 23:59:58,2010-04-30 23:59:59

11186,1,0,250,121.450367,31.340275,2010-04-30 23:59:58,2010-04-30 23:59:59

16569,1,8.4,38,121.434792,31.243405,2010-04-30 23:59:57,2010-04-30 23:59:59

11573,1,73.9,77,121.646227,31.267015,2010-04-30 23:59:58,2010-04-30 23:59:59

11199,1,49.3,252,121.357743,31.296647,2010-04-30 23:59:58,2010-04-30 23:59:59

10349,1,15.7,108,121.451343,31.247853,2010-04-30 23:59:55,2010-04-30 23:59:59

11277,1,0,353,121.463202,31.199197,2010-04-30 23:59:58,2010-04-30 23:59:59

11194,1,0,336,121.399487,31.245877,2010-04-30 23:59:58,2010-04-30 23:59:59

11790,1,0,245,121.42379,31.137098,2010-04-30 23:59:58,2010-04-30 23:59:59

18378,1,22.9,155,121.285398,31.32229,2010-04-30 23:59:58,2010-04-30 23:59:59

但对全文进行处理时，发现了其他的错误，有的数据行不是以2010打头的，有的数据行第二列不是6位数字，比如这些数据：

20112248,49,QS,11768,121.501867,31.207212,62.9,292,1;,2010-05-01 22:48:50

20150122,5108,QS,11646,121.377197,31.18925,8.3,196,1;,2010-05-01 22:51:10

20150122,5122,QS,10181,121.497255,31.241982,0,231,1;,2010-05-01 22:51:23

20150122,5419,QS,13711,121.494252,31.228403,45.5,143,1;,2010-05-01 22:54:20

20150122,5644,QS,10454,121.610453,31.180535,89.5,62,1;,2010-05-01 22:56:44

20192452,,QS,17153,121.474945,31.228778,0,153,1;,2010-05-01 09:24:53

20100501,2238,QS,15257,121.451717,31.33714,0,341,1;,2010-05-01 22:38:29

20100501,2244,QS,17624,121.413983,31.232202,0,88,1;,2010-05-01 22:41:55

20100501,2247,QS,18334,121.452265,31.117727,0,255,1;,2010-05-01 22:44:18

20100501,2246,QS,16304,121.432092,31.236767,23.4,239,1;,2010-05-01 22:45:08

20100501,2243,QS,17050,121.441003,31.226472,0,184,1;,2010-05-01 22:46:24

发现其中2015打头的数据还有规律可循，其他年份的数据规律不太明显（决定丢弃），第二列变为4位数字的看起来也有规律（只是秒数数据丢失，对研究而言不重要，决定用00填充为秒数）。

对于2015打头的数据，可以使用结合sed处理：

sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%'

而对于第二列为4位数字的其他数据，可以结合sed和awk处理：

awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%'

最终定制了一个bash脚本进行处理，输入参数为要处理的日期：

#! /bin/bash

todate=$1

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 6)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:\6,%' |  sed 's%;%%' | awk -F, 'BEGIN{OFS=","} {print $3,$8,$6,$7,$4,$5,$1,$9,$10}' > $todate.csv

grep -v ^2015 $todate.txt | awk -F, '{if((length($2) == 4)) print $0}' | sed 's%\(^2010\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\),%\1-\2-\3 \4:\5:00,%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep ^2015 $todate.txt | sed 's%\(^2015\)\([0-9][0-9]\)\([0-9][0-9]\),\([0-9][0-9]\)\([0-9][0-9]\)%2010-05-\2 \3:\4:\5%' |  awk -F, 'BEGIN{OFS=","} gsub(",1;,",",1,"){print $3,$8,$6,$7,$4,$5,$1,$9,$10}' >> $todate.csv

grep -v QS $todate.csv > $todate-1.csv

mv $todate-1.csv $todate.csv

已发布

2018-06-10

分类

Linux

来自

wlx

标签：

awk, sed, taxi

《 “使用sed和awk对出租车数据进行清洗” 》有 13 条评论

Non presciption androxal

2025-08-17

purchase androxal cheap next day delivery

discount androxal american express canada

回复
how to order enclomiphene new zealand buy online

2025-08-17

ordering enclomiphene cheap fast shipping

ordering enclomiphene purchase online from india

回复
how to buy rifaximin generic ireland

2025-08-17

buy cheap rifaximin australia discount

how to buy rifaximin ireland over the counter

回复
best place to buy generic xifaxan

2025-08-17

cheap xifaxan uk cheap purchase buy

canada cheap xifaxan

回复
ordering staxyn price from cvs

2025-08-17

buy cheap staxyn cheap genuine

buy cheap staxyn australia online no prescription

回复
ordering avodart usa suppliers

2025-08-17

how to buy avodart australia cheap

purchase avodart purchase prescription

回复
buy cheap dutasteride cheap uk buy purchase

2025-08-17

discount dutasteride generic in us

get dutasteride uk cheapest

回复
order flexeril cyclobenzaprine generic online cheapest

2025-08-17

discount flexeril cyclobenzaprine lowest price

cheap flexeril cyclobenzaprine no prescription online

回复
cheapest buy gabapentin price south africa

2025-08-17

is generic gabapentin available

buy cheap gabapentin cheap in canada

回复
how to buy fildena generic low price

2025-08-17

online order fildena buy online uk

cheapest buy fildena generic online uk

回复
online order itraconazole generic best price

2025-08-17

how to order itraconazole australia over the counter

how to order itraconazole cost of tablet

回复
koupit kamagra online kanada

2025-08-17

kamagra s doruﾄ稿nﾃｭm v sobotu

online kamagra

回复
commander générique kamagra

2025-08-17

generique kamagra prescrire eu medicament pharmacie

kamagra de pharmaciens canadiens

回复

使用sed和awk对出租车数据进行清洗

评论

《 “使用sed和awk对出租车数据进行清洗” 》 有 13 条评论

回复 how to order enclomiphene new zealand buy online 取消回复

《 “使用sed和awk对出租车数据进行清洗” 》有 13 条评论