Hello, I'm trying to figure out how to optimise 3-table (many-to-many relation) joins with predicate, limit, and ordering, where one of the tables returns at most one row. This is the query that I have right now: SELECT entity.id FROM ( SELECT entity_tag.entity_id FROM tag JOIN entity_tag ON tag.id = entity_tag.tag_id WHERE tag.key = 'status' AND tag.value = 'SUCCEEDED' ) matched JOIN entity ON matched.entity_id = entity.id WHERE entity.type = 'execution' ORDER BY entity.id DESC LIMIT 10; It runs very slowly when there are many rows matched on entity table which, in my case, there are about 90K rows, even though at most there is only one row returned by tag. Limit (cost=723.39..723.40 rows=1 width=4) (actual time=6189.015..6189.242 rows=10 loops=1) Output: entity.id Buffers: shared hit=411886 read=31282 -> Sort (cost=723.39..723.40 rows=1 width=4) (actual time=6188.999..6189.059 rows=10 loops=1) Output: entity.id Sort Key: entity.id DESC Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=411886 read=31282 -> Nested Loop (cost=1.28..723.38 rows=1 width=4) (actual time=0.153..5590.717 rows=89222 loops=1) Output: entity.id Buffers: shared hit=411886 read=31282 -> Nested Loop (cost=0.86..721.98 rows=3 width=8) (actual time=0.108..1851.707 rows=89222 loops=1) Output: entity_tag.entity_id Buffers: shared hit=65146 read=20646 -> Index Scan using tag_key_value_key on public.tag (cost=0.43..8.45 rows=1 width=4) (actual time=0.043..0.061 rows=1 loops=1) Output: tag.id, tag.key, tag.value, tag.created_at Index Cond: (((tag.key)::text = 'status'::text) AND ((tag.value)::text = 'SUCCEEDED'::text)) Buffers: shared hit=1 read=3 -> Index Only Scan using entity_tag_tag_id_entity_id_idx on public.entity_tag (cost=0.43..711.53 rows=201 width=16) (actual time=0.035..756.829 rows=89222 loops=1) Output: entity_tag.tag_id, entity_tag.entity_id Index Cond: (entity_tag.tag_id = tag.id) Heap Fetches: 89222 Buffers: shared hit=65145 read=20643 -> Index Scan using entity_pkey on public.entity (cost=0.42..0.46 rows=1 width=4) (actual time=0.010..0.017 rows=1 loops=89222) Output: entity.id, entity.entity_id, entity.type, entity.created_at Index Cond: (entity.id = entity_tag.entity_id) Filter: ((entity.type)::text = 'execution'::text) Buffers: shared hit=346740 read=10636 Planning time: 0.817 ms Execution time: 6189.419 ms Both tag_key_value_key and entity_tag_tag_id_entity_id_idx is a UNIQUE constraint on tag(key,value) and entity_tag(tag_id, entity_id) respectively. It seems to me that PostgreSQL runs the nested loop against all of the 90K records because it wants to sort the result before limiting the result. It doesn't take into account of the UNIQUE constraint imposed on the table and thinks that the join being done inside the subquery will change the ordering of entity_id returned by the subquery, thus prompting the sort. I believe with how the index sorted, it should be able to just scan the index backwards because at most only one tag_id will be returned. When I tried changing the predicate here to filter by ID with the following query: -- This runs very fast SELECT entity.id FROM ( SELECT entity_tag.entity_id FROM tag JOIN entity_tag ON tag.id = entity_tag.tag_id WHERE tag.id = 24 ) matched JOIN entity ON matched.entity_id = entity.id WHERE entity.type = 'execution' ORDER BY entity.id DESC LIMIT 10; and it's blazing fast. This time PostgreSQL seems to know that the join inside the subqery won't change the ordering of entity_id returned by the subquery, as seen in the following query explanation: Limit (cost=1.28..1025.56 rows=10 width=4) (actual time=0.144..0.276 rows=1 loops=1) Output: entity.id Buffers: shared hit=12 -> Nested Loop (cost=1.28..1537.70 rows=15 width=4) (actual time=0.125..0.238 rows=1 loops=1) Output: entity.id Buffers: shared hit=12 -> Nested Loop (cost=0.86..1529.06 rows=15 width=12) (actual time=0.057..0.116 rows=1 loops=1) Output: entity_tag.tag_id, entity.id Buffers: shared hit=8 -> Index Only Scan Backward using entity_tag_tag_id_entity_id_idx on public.entity_tag (cost=0.43..454.82 rows=128 width=16) (actual time=0.018..0.038 rows=1 loops=1) Output: entity_tag.tag_id, entity_tag.entity_id Index Cond: (entity_tag.tag_id = 24) Heap Fetches: 1 Buffers: shared hit=4 -> Index Scan using entity_pkey on public.entity (cost=0.42..8.38 rows=1 width=4) (actual time=0.011..0.030 rows=1 loops=1) Output: entity.id, entity.entity_id, entity.type, entity.created_at Index Cond: (entity.id = entity_tag.entity_id) Filter: ((entity.type)::text = 'execution'::text) Buffers: shared hit=4 -> Materialize (cost=0.43..8.45 rows=1 width=4) (actual time=0.040..0.078 rows=1 loops=1) Output: tag.id Buffers: shared hit=4 -> Index Only Scan using tag_pkey on public.tag (cost=0.43..8.45 rows=1 width=4) (actual time=0.021..0.040 rows=1 loops=1) Output: tag.id Index Cond: (tag.id = 24) Heap Fetches: 1 Buffers: shared hit=4 Planning time: 0.362 ms Execution time: 0.458 ms which proves that it can just scan the index backward. I can't figure out why PostgreSQL doesn't take into account the UNIQUE index there. Is there any way to do this with one query? Or is it because the database design is flawed? Environment: - PostgreSQL version: 9.6 - OS: Linux (Amazon RDS) / MacOSX / Docker Installation: - MacOSX: PostgreSQL is installed by using brew. - Docker: PostgreSQL image from https://hub.docker.com/_/postgres. I use the default configuration provided in all of these environments. Thanks! -- Best regards, Aufar Gilbran -- *_Grab is hiring. Learn more at _**https://grab.careers <https://grab.careers/>* By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.