Performance on large, append-only tables

David Yeu <david.yeu@xxxxxxxxx> · Wed, 8 Feb 2012 18:03:04 +0000

Hi there,

We've got a pretty large table that sees millions of new rows a day, and
we're trying our best to optimize queries against it. We're hoping to find
some guidance on this list.

Thankfully, the types of queries that we perform against this table are
pretty constrained. We never update rows and we never join against other
tables. The table essentially looks like this:

| id | group_id | created_at | everything elseŠ

Where `id' is the primary key, auto-incrementing, `group_id' is the
foreign key that we always scope against, and `created_at' is the
insertion time. We have indices against the primary key and the group_id.
Our queries essentially fall into the following cases:

 * Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20;
 * Š WHERE group_id = ? AND id > ? ORDER BY created_at DESC;
 * Š WHERE group_id = ? AND id < ? ORDER BY created_at DESC LIMIT 20;
 * Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20 OFFSET ?;

In human words, we're looking for:

 * The most recent (20) rows.
 * The most recent rows after a given `id'.
 * Twenty rows before a given `id'.
 * Pages of twenty rows.

Originally, this table was part of our primary database, but recently we
saw queries take upwards of thirty seconds or more to complete. Since
we're serving web requests, that's basically unacceptable, and caused a
lot of requests to backup. Our interim solution has been to simply carve
out a new database that hosts only this table, and that has worked to some
degree. We aren't seeing thirty seconds plus database response times
anymore, but some queries still take many seconds and the cost of spinning
up a new master-slave configuration hasn't been cheap.

In the meantime, we're hoping to investigate other ways to optimize this
table and the queries against it. Heroku's data team has suggested balling
up these rows into arrays, where a single row would represent a group_id,
and the data would occupy a single column as an array. We don't have any
experience with this and were wondering if anyone here has tried it.

And finally, we're also trying out alternative stores, since it seems like
this data and its retrieval could be well suited to document-oriented
backends. Redis and DynamoDB are currently the best contenders.

Thanks in advance for any help,

Regards,

Dave Yeu & Neil Sarkar
GroupMe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance