on 5/18/02 4:24 PM, andrew@xxxxxxxxxxxxxxx purportedly said: > PostgreSQL will certainly take LIMIT into account when planning queries. > > EXPLAIN SELECT * FROM mybigtable WHERE x = y; > > and > > EXPLAIN SELECT * FROM mybigtable WHERE x = y LIMIT n; > > Will regularly show different query plans. Limit *is* taken into account, and a different query plan may be used depending on the LIMIT and OFFSET values. However, as you will see below, the query plans are all identical (aside from the report on LIMIT cost, and using an index because of sorting). Here is a real example with a table with 32k records: milewiz=> explain select * from activity; Seq Scan on activity (cost=0.00..20.00 rows=1000 width=39) milewiz=> explain select * from activity limit 200; Limit (cost=0.00..4.00 rows=200 width=39) -> Seq Scan on activity (cost=0.00..20.00 rows=1000 width=39) Here a sequential scan is used but because we want only the first 200 tuples, getting results is much faster. Identical results would likely appear from adding a WHERE clause. However, using LIMIT is next to useless unless the results are ordered, which would be necessary for the application in question: milewiz=> explain select * from activity order by date; Index Scan using act_date_idx on activity (cost=0.00..59.00 rows=1000 width=39) milewiz=> explain select * from activity order by date limit 200; NOTICE: QUERY PLAN: Limit (cost=0.00..11.80 rows=200 width=39) -> Index Scan using act_date_idx on activity (cost=0.00..59.00 rows=1000 width=39) Here the cost is much higher, partly because of the sort, and partly because the more results must be gathered to satisfy the ordering. But only the first call will LIMIT n--subsequent calls must use offset. Consider: milewiz=> explain select * from activity order by date limit 200 offset 200; Limit (cost=11.80..23.60 rows=200 width=39) -> Index Scan using act_date_idx on activity (cost=0.00..59.00 rows=1000 width=39) milewiz=> explain select * from activity order by date limit 200 offset 800; Limit (cost=47.20..59.00 rows=200 width=39) -> Index Scan using act_date_idx on activity (cost=0.00..59.00 rows=1000 width=39) milewiz=> explain select * from activity order by date limit 200 offset 10000; Limit (cost=59.00..59.00 rows=1 width=39) -> Index Scan using act_date_idx on activity (cost=0.00..59.00 rows=1000 width=39) The optimizer assumes a complete table return will only return 1000 rows, which in fact will return 32k rows, so as we progress through calls, at a certain point the optimizer will execute a plan with the same cost as a full table scan, regardless of the limit. Frequent VACUUM ANALYZE will help this somewhat. And as you can see, the query plans are all identical. The performance will depend mostly on the differences between field values, which will effect how many records must be scanned in order to fulfill the LIMIT. At a certain point, this becomes the same as a full scan. With smaller table results, this will happen sooner in the game. > For more complex queries, however, the plans are less likely to differ > Actually, as you can see with my examples above, the reverse is true, as the simple query produced the same plan. The more complex the query, the more choices the optimizer has for picking the best case, which unfortunately is not always the truly best case. > > OTOH PostgreSQL does _not_ do result caching, unless you have applied > some patches that were floating around last year some time. Yes, I recalled wrong. Postgres does not cache results, but caches tables. So subsequent queries on particular tables will improve, as long as there is enough shared memory to accommodate the tables, or may vary based on results. If only a partial table can be kept in cache, as long as Postgres can retrieve just from the cached part, queries will be optimized. Keary Suska Esoteritech, Inc. "Leveraging Open Source for a better Internet"