Search Postgresql Archives

Re: strange sql behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2016-02-01 12:35:35 -0600, Yu Nie wrote:
> Recently I am working with a large amount of taxis GIS data and had encountered
> some weird performance issues.  I am hoping someone in this community can help
> me figure it out.
> 
> The taxi data were loaded in 5 minute block into a table.  I have two separate
> such tables, one stores a month of data with about 700 million rows, another
> stores about 10 days of data with about 300 million rows.  The two tables have
> the exactly same schema and indexes. There are two indexes: one on taxiid
> (text), and the other on the time stamp (date time).  In order to process the
> data, I need to get all points for a single taxis; to do that, I use something
> like:
>  select * from table1 where taxiid = 'SZB00S41' order by time;
> What puzzled me greatly is that this query runs consistently much faster for
> the large table than for the small table, which seems to contradict with
> intuition.
[...]

> Results for the small table: it took 141 seconds to finish.  The planning time
> is 85256.31
> 
> "Sort  (cost=85201.05..85256.31 rows=22101 width=55) (actual time=
> 141419.499..141420.025 rows=20288 loops=1)"
> "  Sort Key: "time""
> "  Sort Method: quicksort  Memory: 3622kB"
> "  Buffers: shared hit=92 read=19816"
> "  ->  Bitmap Heap Scan on data2013_01w  (cost=515.86..83606.27 rows=22101
> width=55) (actual time=50.762..141374.777 rows=20288 loops=1)"
> "        Recheck Cond: ((taxiid)::text = 'SZB00S41'::text)"
> "        Heap Blocks: exact=19826"
> "        Buffers: shared hit=92 read=19816"
                                  ^^^^^^^^^^
[...]
> 
> Results for the large table: it took 5 seconds to finish.  The planning time is
> 252077.10
> "Sort  (cost=251913.32..252077.10 rows=65512 width=55) (actual time=
> 5038.571..5039.765 rows=44204 loops=1)"
> "  Sort Key: "time""
> "  Sort Method: quicksort  Memory: 7753kB"
> "  Buffers: shared hit=2 read=7543"
> "  ->  Bitmap Heap Scan on data2011_01  (cost=1520.29..246672.53 rows=65512
> width=55) (actual time=36.935..5017.463 rows=44204 loops=1)"
> "        Recheck Cond: ((taxiid)::text = 'SZB00S41'::text)"
> "        Heap Blocks: exact=7372"
> "        Buffers: shared hit=2 read=7543"
                                 ^^^^^^^^^
[....]

The obvious difference is that the query for the smaller table needs to
read about 2.5 times as many blocks (for 1/3 of the records) from the
disk. This suggests that the data for a single taxi is more localized in
the larger table. In addition, the average time per block on the smaller
table is about 7 ms, which is a typical random seek time for a disk. So
the blocks are probably randomly scattered through the table. For the
larger table, the average time is well below 1 ms, so there are probably
many consecutive blocks to read.

There are 2880 5 minute intervals in 10 days. You have about 22k records
per taxi, so there are about 7.6 records for each taxi per interval.
This is very close to the number of records per block in your second
query (65512/7372 = 8.9). I suspect that the records in your larger
table are sorted by taxiid within each interval. 

You can almost certainly get a similar speedup by sorting each 5 minute
interval by taxi id before appending it to the table.

If querying by taxiid is typical and your table is static, you should
consider clustering the table by taxiid. If your table is updated every
5 minutes, you could partition it by day and cluster each partition as
soon as it is not written any more.

	hp

-- 
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants 
| |   | hjp@xxxxxx         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment: signature.asc
Description: Digital signature


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux