I have a fairly large table (~100M rows), let's call it "events", and among other things it has a couple of columns on it, columns that we'll call entity_type_id (an integer) and and published_at (a timestamp). It has, among others, indices on (published_at) and (entity_type_id, published_at). A very common query against this table is of the form... SELECT * FROM events WHERE entity_type_id = XXX ORDER BY published_at DESC LIMIT 25; ... to get the most recent 25 events from the table for a given type of entity, and generally the query planner does the expected thing of using the two-part index on (entity_type_id, published_at). Every now and again, though, I have found the query planner deciding that it ought use the single column (published_at) index. This can, unsurprisingly, result in horrendous performance if events for a given entity type are rare, as we end up with a very long walk of an index. I had this happen again yesterday and I noticed something of particular interest pertaining to the event. Specifically, the query was for an entity type that the system had only seen for the first time one day prior, and furthermore the events table had not been analyzed by the statistics collector for a couple of weeks. My intuition is that the query planner, when working with an enormous table, and furthermore encountering an entity type that the statistics collector had never previously seen, would assume that the number of rows in the events table of that entity type would be very small, and therefore the two-part index on (entity_type_id, published_at) would be the right choice. Nonetheless, an EXPLAIN was showing usage of the (published_at) index, and since there were only ~20 rows in the entire events table for that entity type the queries were getting the worst possible execution imaginable, i.e. reading in the whole table to find the rows that hit, but doing it with the random I/O of an index walk. As an experiment, I ran a VACUUM ANALYZE on the events table, and then re-ran the EXPLAIN of the query, and... Same query plan again... Maybe for whatever issue I am having the random sampling nature of the statistics collector made it unhelpful, i.e. in its sampling of the ~100M rows it never hit a single row that had the new entity type specified? Other possibly relevant pieces of information... The entity type column has a cardinality in the neighborhood of a couple dozen. Meanwhile, for some of the entity types there is a large and ongoing number of events, and for other entity types there is a smaller and more sporadic number of events. Every now and again a new entity type shows up. I can't understand why the query planner would make this choice. Maybe it has gotten ideas into its head about the distribution of data? Or maybe there is a subtle bug that my data set is triggering? Or maybe I need to turn some knobs on statistics collection? Or maybe it's all of these things together? I worry that even if there is a knob turning exercise that helps that we're still going to get burned whenever a new entity type shows up until we re-run ANALYZE, assuming that I can find a fix that involves tweaking statistics collection. I just can't fathom how it would ever be the case that Postgres's choice of index usage in this case would make sense. It doesn't even slot cleanly into the problem space of "why did Postgres do a sequential scan instead of an index scan?". If you're doing a query of the described form and the entity type is specified, wouldn't the two-part index theoretically _always_ yield better performance than the one-part index? Maybe I have a flawed understanding of the cost of using various indexes? Maybe there is something analogous between sequential-versus-index-scan and one-part-versus-two-part-index scan choices? FWIW, we're running on 8.4.X and using the out-of-the-box default_statistics_target setting and haven't dabbled with setting table level statistics configurations. Thoughts? Recommended reading? -- AWG -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance