I have the following three tables:
DOCUMENT
id (index)
documenttype
date_last_updated: timestamp(6) (indexed)
EXTERNAL_TRANSLATION_UNIT
id (indexed)
fk_id_document (indexed)
EXTERNAL_TRANSLATION
id (indexed)
fk_id_translation_unit (indexed)
Table sizes:
DOCUMENT: 381 000
EXTERNAL_TRANSLATION_UNIT: 76 000 000
EXTERNAL_TRANSLATION: 76 000 000
Now the following query takes about 36 minutes to finish:
SELECT u.id AS id_external_translation_unit,
r.id AS id_external_translation,
u.fk_id_language AS fk_id_source_language,
r.fk_id_language AS fk_id_target_language,
doc.fk_id_job
FROM "EXTERNAL_TRANSLATION_UNIT" u
JOIN "DOCUMENT" doc ON u.fk_id_document = doc.id
JOIN "EXTERNAL_TRANSLATION" r ON u.id = r.fk_id_translation_unit
WHERE doc.date_last_updated >= date(now() - '171:00:00'::interval)
ORDER BY r.id LIMIT 1000
This is the query plan:
<http://postgresql.nabble.com/file/n5864045/qp1.png>
If I remove the WHERE condition, it returns immediately.
So does "SELECT 1;" - but since that doesn't give the same answer it is not very relevant.
Am I doing something obviously wrong?
Not obviously...
Thank you for any ideas.
Consider updating the translation tables at the same time the document table is updated. That way you can apply the WHERE and ORDER BY clauses against the same table.
I presume you've run ANALYZE on the data.
I would probably try something like:
WITH docs AS ( SELECT ... WHERE date > ...)
SELECT ... FROM (translations join translation_unit) t
WHERE EXISTS (SELECT 1 FROM docs WHERE t.doc_id = docs.doc_id)
ORDER BY t.id LIMIT 1000
You are trying to avoid the NESTED LOOP and the above has a decent chance of materializing docs and then building either a bit or hash map for both docs and translations thus performing a single sequential scan over both instead of performing 70+ million index lookups.
Take this with a grain of salt as my fluency in this area is limited - I tend to work with trial-and-error but without data that is difficult.
I'm not sure if the planner could be smarter because you are asking a question it is not particularly suited to estimating - namely cross-table correlations. Rethinking the model is likely to give you a better outcome long-term though it does seem like there should be room for improvement within the stated query and model.
As Tomas said you likely will benefit from increased working memory in order to make materializing and hashing/bitmapping favorable compared to a nested loop.
David J.