Re: query against large table not using sensible index to find very small amount of data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Your understanding of the utility of multi-part indices does not jive
with my own.

While I agree that a partial index might be in order here, that ought
just be a performance optimization that lowers the footprint of the
index from an index size and index maintenance standpoint, not
something that governs when the index is used for an item whose entity
type rarely comes up in the table.  If a couple of the entity types
were to constitute 80% of the events, then using a partial index would
reduce the performance strain of maintaining the index by 80%, but
this ought not govern the query planner's behavior when doing queries
on entity types that were not among those.

My general understanding of the utility of multi-part indices is that
they will come into play when some number of the leading columns
appear in the query as fixed values and furthermore if a subsequent
column appears as part of a ranging operation.

I know that a b-tree structure isn't exactly the same as a
binary-tree, but it is roughly equivalent for the purposes of our
conversation...  I believe you can think of multi-part indices as
(roughly) equivalent either to nested binary trees, or as equivalent
to a binary tree whose keys are the concatenation of the various
columns.  In the former case, doing a range scan would be a matter of
hopping through the nested trees until you got to the terminal range
scan operation, and in the latter case doing a range scan would be a
matter of finding the first node in the tree that fell within the
values for your concatenation and then walking through the tree.  Yes,
that's not exactly what happens with a b-tree, but it's pretty
similar, the main differences being performance operations, I believe.

Given that, I don't understand how having a multi-part index with the
column over which I intend to range comes _earlier_ than the column(s)
that I intend to have be fixed would be helpful.  This is especially
true given that the timestamp columns are are the granularity of
_milliseconds_ and my data set sees a constant stream of inputs with
bursts up to ~100 events per second.  I think what you are describing
could only make sense if the date column were at a large granularity,
e.g hours or days.

Or maybe I have missed something...

  -- AWG

On Tue, Apr 08, 2014 at 01:39:41PM +0000, Shaun Thomas wrote:
> 
> > Other possibly relevant pieces of information...  The entity type
> > column has a cardinality in the neighborhood of a couple dozen.
> > Meanwhile, for some of the entity types there is a large and ongoing
> > number of events, and for other entity types there is a smaller and
> > more sporadic number of events.  Every now and again a new entity
> > type shows up.
> 
> With that as the case, I have two questions for you:
> 
> 1. Why do you have a low cardinality column as the first column in an index?
> 2. Do you have any queries at all that only use the entity type as the only where clause?
> 
> I agree that the planner is probably wrong here, but these choices aren't helping. The low cardinality of the first column causes very large buckets that don't limit results very well at all. Combined with the order-by clause, the planner really wants to walk the date index backwards to find results instead. I would do a couple of things.
> 
> First, remove the type/date index. Next, do a count of each type in the table with something like this:
> 
> SELECT type_id, count(1)
>   FROM my_table
>  GROUP BY 2
> 
> Any type that is more than 20% of the table will probably never be useful in an index. At this point, you have a choice. You can create a new index with date and type *in that order* or create a new partial index on date and type that also ignores the top matches. For instance, if you had a type that was 90% of the values, this would be my suggestion:
> 
> CREATE INDEX idx_foo_table_date_event_type_part ON foo_table (event_date, event_type)
>  WHERE event_type != 14;
> 
> Or whatever. If the IDs are basically evenly distributed, it won't really matter.
> 
> In any case, index order matters. The planner wants to restrict data as quickly as possible. If you provide an order clause, it wants to read the index in that order. Your specified type as the first column disrupts that, so it has to fetch the values first, which is usually more expensive. Even if that's wrong in your particular case, planner stats are not precise enough to know that.
> 
> Either way, try moving the indexes around. I can't think of many indexes in our database where I have the low cardinality value as the first column. Databases have an easier time managing many shallow buckets of values, than a few deep ones.
> 
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
> 312-676-8870
> sthomas@xxxxxxxxxxxxxxxx
> 
> ______________________________________________
> 
> See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux