在我们的数据库中,灌入了上海出租车数据,大概有36亿条数据,开始主要的目的是测试MPP数据库性能,也顺手开展了一些出租车问题的研究。
在从出租车点数据转化为线数据过程中,遇到了一个奇怪的sql优化问题。
点数据表为taxi_point:
ius=# \d taxi_point Append-Only Columnar Table "public.taxi_point" Column | Type | Modifiers ---------+-----------------------------+----------- tid | integer | not null isempty | smallint | lon | real | lat | real | gpstime | timestamp without time zone | Checksum: t Distributed randomly
包含每段行程的表taxi_trip:
ius=# \d taxi_trip Table "public.taxi_trip" Column | Type | Modifiers -------------+-----------------------------+----------- tid | integer | not null isempty | smallint | not null pickup_lon | real | not null pickup_lat | real | not null dropoff_lon | real | not null dropoff_lat | real | not null distance | real | pickup_ts | timestamp without time zone | not null dropoff_ts | timestamp without time zone | not null geom | geometry(LineString,4326) | Indexes: "taxi_trip_pkey" PRIMARY KEY, btree (tid, isempty, pickup_ts) Distributed by: (tid, isempty, pickup_ts)
两个表对应的记录数量大致为千万级和十亿级别:
ius=# select count(*) from taxi_point; count ------------ 1480896950 (1 row) ius=# select count(*) from taxi_trip; count ---------- 25110980 (1 row)
现在,想从taxi_point表中更新每段行程的线数据到taxi_trip表中的geom字段,直观上写出来SQL语句:
update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326)
没想到,这个sql执行非常慢,三天都没有执行完。于是就想进一步优化,看看能不能提高速度。首先在一个点上的数据进行测试:
update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;
发现这个语句也很慢,搜索了一下tid为30050的记录数:
ius=# select count(*) from taxi_trip where tid=30050; count ------- 1042 (1 row) ius=# select count(*) from taxi_point where tid=30050; count ------- 92090 (1 row)
两个表中的相关数据都不大,为啥执行速度慢呢?再进行测试,将限定条件前置,看看是不是array select导致的问题:
update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=30050 and t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;
有点诧异的是,这个SQL语句执行的速度很快!说明最外层的限定条件在查询优化时并没有传递到array select子查询语句中。那要解决此问题,就需要使用自定义函数来实现了,又遇到了坑:
create or replace function update_trip_geom() returns void as $$ declare curtid integer; usql text; begin for curtid in select distinct tid from taxi_trip order by tid loop raise notice 'Dealing with tid: %',curtid; update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=curtid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=curtid; end loop; end; $$ LANGUAGE plpgsql;
执行这个函数,发现又回归了龟速级别,说明tid参数并没有进行查询优化,于是进一步修改,尝试使用文本sql来代替执行:
create or replace function update_trip_geom() returns void as $$ declare curtid integer; usql text; begin for curtid in select distinct tid from taxi_trip order by tid loop raise notice 'Dealing with tid: %',curtid; usql := 'update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid='||curtid||' and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid='||curtid||';'; execute usql; end loop; end; $$ LANGUAGE plpgsql;
终于回归到正常速度了。
发表回复