Thanks for the suggestions.
David’s assumption is correct in that this is a log table
with no updates or deletes. I tried making it CLUSTERED in our test
environment, but it didn’t seem to make any difference for my query. It
did take about 6 hours to do the transformation, so it would be difficult to
find the time to do in production, but I’m sure we could work something
out if it really looked beneficial. Unfortunately, as I said, initial tests don’t
seem to indicate any benefit.
I believe that my performance difficulty comes from the need for
DISTINCT (or GROUP BY) data. That is, the normal start_time index seems fine
for limiting the date range, but when I need to select DISTINCT data from the
date range it seems that Postgres still needs to scan the entire limited date
range.
Unfortunately, we support arbitrary date range queries on this
table, so I don’t think the partitioning idea is an option for us.
What I’m playing with now is creating separate tables to
hold the channel_name, ad_name, and player_name data with PRIMARY KEY ids. Since
there are very few of these compared to the number of rows in the main table,
this will give me a quick way to get the DISTINCT values over the entire data
set. My problem then will be reducing that to the DISTINCT values for a limited
date range.
As a side effect bonus of this I expect the database to shrink
considerably as these text fields, although not that long (roughly 20 to 50
characters), are certainly longer than a simple foreign key reference.
--Rainer
From: scott@xxxxxxxxxx
[mailto:scott@xxxxxxxxxx] On Behalf Of Scott Carey
Sent: Friday, August 29, 2008 8:02 AM
To: David Rowley
Cc: Rainer Mager; pgsql-performance@xxxxxxxxxxxxxx
Subject: Re: [PERFORM] indexing for distinct search in timestamp based
table
Another suggestion is to
partition the table by date ranges. If most of the range queries occur on
particular batches of time, this will make all queries more efficient, and
improve locality and efficiency of all indexes on the table.
This is more work than simply a table CLUSTER, especially in maintenance
overhead, but it will generally help a lot in cases like these.
Additionally, if these don't change much after some period of time the tables
older than the modification window can be vacuumed, clustered, and reindexed if
needed to make them as efficient as possible and maintenance free after that point
(other than backups and archives).
Another benefit of clustering is in backup / restore. You can
incrementally back up only the index partitions that have changed -- for large
databases this reduces pg_dump and pg_restore times substantially. To do
this you combine regular expressions with the pg_dump "exclude
tables" or "include tables" flags.
On Thu, Aug 28, 2008 at 3:48 PM, David Rowley <dgrowley@xxxxxxxxx> wrote:
I once also had a similar performance problem when looking
for all matching
rows between two timestamps. In fact that's why I'm here today. The problem
was with MySQL. I had some tables of around 10 million rows and all my
searching was timestamp based. MySQL didn't do what I wanted. I found that
using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
the matching rows I was working with was not much over the 100k mark. I'm
wondering if clustering the table on ad_log_start_time will help cut down on
random reads.
That's if you can afford to block the users while postgresql clusters the
table.
If you're inserting in order of the start_time column (which I was) then the
cluster should almost maintain itself (I think), providing you're not
updating or deleting anyway, I'd assume that since it looks like a log
table.
David.
-----Original Message-----
From: pgsql-performance-owner@xxxxxxxxxxxxxx
[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx]
On Behalf Of Rainer Mager
Sent: 28 August 2008 09:06
To: pgsql-performance@xxxxxxxxxxxxxx
Subject: [PERFORM] indexing for distinct search in timestamp based table
I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.
The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
Table
"public.ad_log"
Column |
Type |
Modifiers
--------------+-----------------------------+-------------------------------
-----------------------------
ad_log_id | integer
| not null default
nextval('ad_log_ad_log_id_seq'::regclass)
channel_name | text
| not null
player_name | text
| not null
ad_name | text
| not null
start_time | timestamp without time zone | not null
end_time | timestamp without time zone | not null
Indexes:
"ad_log_pkey" PRIMARY KEY, btree (ad_log_id)
"ad_log_channel_name_key" UNIQUE, btree (channel_name,
player_name,
ad_name, start_time, end_time)
"ad_log_ad_and_start" btree (ad_name, start_time)
"ad_log_ad_name" btree (ad_name)
"ad_log_all" btree (channel_name, player_name,
start_time, ad_name)
"ad_log_channel_name" btree (channel_name)
"ad_log_end_time" btree (end_time)
"ad_log_player_and_start" btree (player_name,
start_time)
"ad_log_player_name" btree (player_name)
"ad_log_start_time" btree (start_time)
The query I'm trying to speed up is below. In it the <field> tag can be
one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.
SELECT <field> FROM ad_log
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name
AND ad_name LIKE :ad_name
AND start_time BETWEEN :start_date AND
(date(:end_date) + 1)
GROUP BY <field> ORDER BY <field>
A typical query is:
explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;
with the result being:
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-------
Sort (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
Sort Key: channel_name
Sort Method: quicksort Memory: 17kB
-> HashAggregate (cost=1163168.91..1163168.96 rows=5
width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
-> Bitmap Heap Scan on ad_log
(cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
Recheck Cond: ((start_time
>= '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time <=
'2008-07-29'::date))
Filter: ((channel_name ~~
'%'::text) AND (ad_name ~~
'%'::text))
-> Bitmap Index Scan
on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
Index
Cond: ((start_time >= '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time <=
'2008-07-29'::date))
Total runtime: 75460.361 ms
It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance