Re: indexing for distinct search in timestamp based table

"Scott Carey" <scott@xxxxxxxxxxxxxxxxx> · Thu, 28 Aug 2008 16:01:58 -0700

Another suggestion is to partition the table by date ranges.  If most of the range queries occur on particular batches of time, this will make all queries more efficient, and improve locality and efficiency of all indexes on the table.

This is more work than simply a table CLUSTER, especially in maintenance overhead, but it will generally help a lot in cases like these.  Additionally, if these don't change much after some period of time the tables older than the modification window can be vacuumed, clustered, and reindexed if needed to make them as efficient as possible and maintenance free after that point (other than backups and archives).

Another benefit of clustering is in backup / restore.  You can incrementally back up only the index partitions that have changed -- for large databases this reduces pg_dump and pg_restore times substantially.  To do this you combine regular expressions with the pg_dump "exclude tables" or "include tables" flags.

On Thu, Aug 28, 2008 at 3:48 PM, David Rowley <dgrowley@xxxxxxxxx> wrote:

I once also had a similar performance problem when looking for all matching

rows between two timestamps. In fact that's why I'm here today. The problem

was with MySQL. I had some tables of around 10 million rows and all my

searching was timestamp based. MySQL didn't do what I wanted. I found that

using a CLUSTERED index with postgresql to be lightning quick. Yet mostly

the matching rows I was working with was not much over the 100k mark. I'm

wondering if clustering the table on ad_log_start_time will help cut down on

random reads.

That's if you can afford to block the users while postgresql clusters the

table.

If you're inserting in order of the start_time column (which I was) then the

cluster should almost maintain itself (I think), providing you're not

updating or deleting anyway, I'd assume that since it looks like a log

table.

David.

-----Original Message-----

From: pgsql-performance-owner@xxxxxxxxxxxxxx

[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Rainer Mager

Sent: 28 August 2008 09:06

To: pgsql-performance@xxxxxxxxxxxxxx

Subject: [PERFORM] indexing for distinct search in timestamp based table

I'm looking for some help in speeding up searches. My table is pretty simple

(see below), but somewhat large, and continuously growing. Currently it has

about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the

appropriate ones and drop the extras):

                                          Table "public.ad_log"

    Column    |            Type             |

Modifiers

--------------+-----------------------------+-------------------------------

-----------------------------

 ad_log_id    | integer                     | not null default

nextval('ad_log_ad_log_id_seq'::regclass)

 channel_name | text                        | not null

 player_name  | text                        | not null

 ad_name      | text                        | not null

 start_time   | timestamp without time zone | not null

 end_time     | timestamp without time zone | not null

Indexes:

    "ad_log_pkey" PRIMARY KEY, btree (ad_log_id)

    "ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name,

ad_name, start_time, end_time)

    "ad_log_ad_and_start" btree (ad_name, start_time)

    "ad_log_ad_name" btree (ad_name)

    "ad_log_all" btree (channel_name, player_name, start_time, ad_name)

    "ad_log_channel_name" btree (channel_name)

    "ad_log_end_time" btree (end_time)

    "ad_log_player_and_start" btree (player_name, start_time)

    "ad_log_player_name" btree (player_name)

    "ad_log_start_time" btree (start_time)

The query I'm trying to speed up is below. In it the <field> tag can be one

of channel_name, player_name, or ad_name. I'm actually trying to return the

distinct values and I found GROUP BY to be slightly faster than using

DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses

in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT <field> FROM ad_log

        WHERE channel_name LIKE :channel_name

        AND player_name LIKE :player_name

        AND ad_name LIKE :ad_name

        AND start_time BETWEEN :start_date AND (date(:end_date) + 1)

        GROUP BY <field> ORDER BY <field>

A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'

AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND

(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:

QUERY PLAN

----------------------------------------------------------------------------

----------------------------------------------------------------------------

-------

 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual

time=75460.187..75460.192 rows=15 loops=1)

   Sort Key: channel_name

   Sort Method:  quicksort  Memory: 17kB

   ->  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual

time=75460.107..75460.114 rows=15 loops=1)

         ->  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84

rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296

loops=1)

               Recheck Cond: ((start_time >= '2008-07-01

00:00:00'::timestamp without time zone) AND (start_time <=

'2008-07-29'::date))

               Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~

'%'::text))

               ->  Bitmap Index Scan on ad_log_start_time

(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443

rows=13701296 loops=1)

                     Index Cond: ((start_time >= '2008-07-01

00:00:00'::timestamp without time zone) AND (start_time <=

'2008-07-29'::date))

 Total runtime: 75460.361 ms

It seems to me there should be some way to create an index to speed this up,

but the various ones I've tried so far haven't helped. Any suggestions would

be greatly appreciated.

--

Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-performance

--

Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-performance