SQL问题:array select子查询优化

在我们的数据库中,灌入了上海出租车数据,大概有36亿条数据,开始主要的目的是测试MPP数据库性能,也顺手开展了一些出租车问题的研究。

在从出租车点数据转化为线数据过程中,遇到了一个奇怪的sql优化问题。

点数据表为taxi_point:

ius=# \d taxi_point
  Append-Only Columnar Table "public.taxi_point"
 Column  |            Type             | Modifiers 
---------+-----------------------------+-----------
 tid     | integer                     | not null
 isempty | smallint                    | 
 lon     | real                        | 
 lat     | real                        | 
 gpstime | timestamp without time zone | 
Checksum: t
Distributed randomly

包含每段行程的表taxi_trip:

ius=# \d taxi_trip
               Table "public.taxi_trip"
   Column    |            Type             | Modifiers 
-------------+-----------------------------+-----------
 tid         | integer                     | not null
 isempty     | smallint                    | not null
 pickup_lon  | real                        | not null
 pickup_lat  | real                        | not null
 dropoff_lon | real                        | not null
 dropoff_lat | real                        | not null
 distance    | real                        | 
 pickup_ts   | timestamp without time zone | not null
 dropoff_ts  | timestamp without time zone | not null
 geom        | geometry(LineString,4326)   | 
Indexes:
    "taxi_trip_pkey" PRIMARY KEY, btree (tid, isempty, pickup_ts)
Distributed by: (tid, isempty, pickup_ts)

两个表对应的记录数量大致为千万级和十亿级别:

ius=# select count(*) from taxi_point;
   count    
------------
 1480896950
(1 row)

ius=# select count(*) from taxi_trip;
  count   
----------
 25110980
(1 row)

现在,想从taxi_point表中更新每段行程的线数据到taxi_trip表中的geom字段,直观上写出来SQL语句:

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326)

没想到,这个sql执行非常慢,三天都没有执行完。于是就想进一步优化,看看能不能提高速度。首先在一个点上的数据进行测试:

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;

发现这个语句也很慢,搜索了一下tid为30050的记录数:

ius=# select count(*) from taxi_trip where tid=30050;
 count 
-------
  1042
(1 row)
ius=# select count(*) from taxi_point where tid=30050;
 count 
-------
 92090
(1 row)

两个表中的相关数据都不大,为啥执行速度慢呢?再进行测试,将限定条件前置,看看是不是array select导致的问题:

update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=30050 and t.tid=tp.tid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=30050;

有点诧异的是,这个SQL语句执行的速度很快!说明最外层的限定条件在查询优化时并没有传递到array select子查询语句中。那要解决此问题,就需要使用自定义函数来实现了,又遇到了坑:

create or replace function update_trip_geom() returns void as $$
declare
curtid integer;
usql text;
begin
  for curtid in select distinct tid from taxi_trip order by tid 
  loop
     raise notice 'Dealing with tid: %',curtid;
     update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid=curtid and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid=curtid;
     
  end loop;
end;
$$ LANGUAGE plpgsql;

执行这个函数,发现又回归了龟速级别,说明tid参数并没有进行查询优化,于是进一步修改,尝试使用文本sql来代替执行:

create or replace function update_trip_geom() returns void as $$
declare
curtid integer;
usql text;
begin
  for curtid in select distinct tid from taxi_trip order by tid 
  loop
     raise notice 'Dealing with tid: %',curtid;
     usql := 'update taxi_trip tp set geom=ST_SetSRID(ST_MakeLine(array(select ST_MakePoint(lon,lat) from taxi_point t where t.tid='||curtid||' and t.gpstime>=tp.pickup_ts and t.gpstime<=tp.dropoff_ts order by gpstime asc)),4326) where tid='||curtid||';'; 
     execute usql; 
  end loop;
end;
$$ LANGUAGE plpgsql;

终于回归到正常速度了。


已发布

分类

,

来自

标签:

评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注