Alvaro Herrera wrote:
Performance analysis of strange queries is useful, but the input queries
have to be meaningful as well. Otherwise you end up optimizing bizarre
and useless cases.
I had a similar one a few weeks ago. I did some batch-processing over a
bunch of documents and discovered postgresql was faster if I let it
process just 1000 documents, in stead of all 45000 at the same time. But
with 1000 it was faster than 1000x one document.
So I started with a query like:
SELECT docid, (SELECT work to be done for each document)
FROM documents
ORDER BY docid
LIMIT 1000
OFFSET ?
And I noticed the 44th iteration was much slower than the first.
Rewriting it to something like this made the last iteration about as
fast as the first:
SELECT docid, (SELECT work to be done for each document)
FROM documents
WHERE docid IN (SELECT docid FROM documents
ORDER BY docid
LIMIT 1000
OFFSET ?
)
I know something like that isn't very set-based thinking, but then again
the query's structure did come from a iterative algoritm, but turned out
to be faster (less query-overhead) and easier to scale in PostgreSQL.
I've tried a few more set-like structures, but those were all slower
than this aproach probably because they would be were a little more
complex. Some of them took more than 10x the amount of time...
Another real-life example would be to display the amount of replies to a
topic in a topic listing of a forum or the name of the author of the
last message. You probably don't want to count all the replies for each
topic if you're only going to display headings 100 - 200.
And there are a few more examples to think of where a join+group by
isn't going to work, but a subquery in the selectlist just does what you
want.
Of course most of the time you won't be using a OFFSET then.
Best regards,
Arjen