SQL问题：array select子查询优化

在我们的数据库中，灌入了上海出租车数据，大概有36亿条数据，开始主要的目的是测试MPP数据库性能，也顺手开展了一些出租车问题的研究。

在从出租车点数据转化为线数据过程中，遇到了一个奇怪的sql优化问题。

点数据表为taxi_point：

ius=# \d taxi_point
  Append-Only Columnar Table "public.taxi_point"
 Column  |            Type             | Modifiers 
---------+-----------------------------+-----------
 tid     | integer                     | not null
 isempty | smallint                    | 
 lon     | real                        | 
 lat     | real                        | 
 gpstime | timestamp without time zone | 
Checksum: t
Distributed randomly

包含每段行程的表taxi_trip：

ius=# \d taxi_trip
               Table "public.taxi_trip"
   Column    |            Type             | Modifiers 
-------------+-----------------------------+-----------
 tid         | integer                     | not null
 isempty     | smallint                    | not null
 pickup_lon  | real                        | not null
 pickup_lat  | real                        | not null
 dropoff_lon | real                        | not null
 dropoff_lat | real                        | not null
 distance    | real                        | 
 pickup_ts   | timestamp without time zone | not null
 dropoff_ts  | timestamp without time zone | not null
 geom        | geometry(LineString,4326)   | 
Indexes:
    "taxi_trip_pkey" PRIMARY KEY, btree (tid, isempty, pickup_ts)
Distributed by: (tid, isempty, pickup_ts)

两个表对应的记录数量大致为千万级和十亿级别：

ius=# select count(*) from taxi_point;
   count    
------------
 1480896950
(1 row)

ius=# select count(*) from taxi_trip;
  count   
----------
 25110980
(1 row)

现在，想从taxi_point表中更新每段行程的线数据到taxi_trip表中的geom字段，直观上写出来SQL语句：

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326)

没想到，这个sql执行非常慢，三天都没有执行完。于是就想进一步优化，看看能不能提高速度。首先在一个点上的数据进行测试：

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;

发现这个语句也很慢，搜索了一下tid为30050的记录数：

ius=# select count(*) from taxi_trip where tid=30050;
 count 
-------
  1042
(1 row)
ius=# select count(*) from taxi_point where tid=30050;
 count 
-------
 92090
(1 row)

两个表中的相关数据都不大，为啥执行速度慢呢？再进行测试，将限定条件前置，看看是不是array select导致的问题：

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=30050 and t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;

有点诧异的是，这个SQL语句执行的速度很快！说明最外层的限定条件在查询优化时并没有传递到array select子查询语句中。那要解决此问题，就需要使用自定义函数来实现了，又遇到了坑：

create or replace function update_trip_geom() returns void as $$
declare
curtid integer;
usql text;
begin
  for curtid in select distinct tid from taxi_trip order by tid 
  loop
     raise notice 'Dealing with tid: %',curtid;
     update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=curtid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=curtid;
     
  end loop;
end;
$$ LANGUAGE plpgsql;

执行这个函数，发现又回归了龟速级别，说明tid参数并没有进行查询优化，于是进一步修改，尝试使用文本sql来代替执行：

create or replace function update_trip_geom() returns void as $$
declare
curtid integer;
usql text;
begin
  for curtid in select distinct tid from taxi_trip order by tid 
  loop
     raise notice 'Dealing with tid: %',curtid;
     usql := 'update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid='||curtid||' and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid='||curtid||';'; 
     execute usql; 
  end loop;
end;
$$ LANGUAGE plpgsql;

终于回归到正常速度了。

SQL问题：array select子查询优化

评论

发表回复取消回复

SQL问题：array select子查询优化

评论

发表回复 取消回复

发表回复取消回复