Re: Guidelines on best indexing strategy for varying searches on 20+ columns

Niels Kristian Schjødt <nielskristian@xxxxxxxxxxxxx> · Tue, 1 Jul 2014 22:08:32 +0200

Thanks for the answers.

> Den 30/06/2014 kl. 20.04 skrev Jeff Janes <jeff.janes@xxxxxxxxx>:
> 
> On Wed, Jun 25, 2014 at 1:48 AM, Niels Kristian Schjødt
> <nielskristian@xxxxxxxxxxxxx> wrote:
>> Hi,
>> I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation.
>> 
>> Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search.
>> 
>> The problem:
>> 
>> The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score.
>> 
>> What I’ve done so far:
>> 
>> I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column.
> 
> I'd probably partition the data on whether it is for sale, and then
> search only the for-sale partition.

Hmm okay, I already did all the indexes partial based on the for-sales state. If the queries always queries for-sale, and all indexes are partial based on those, will it then still help performance / make sense to partition the tables?
> 
>> 
>> Questions:
>> 
>> 1. How would you go about analyzing and determining what columns should be indexed, and how?
> 
> I'd start out with intuition about which columns are likely to be used
> most often, and in a selective way.  And followup by logging slow
> queries so they can be dissected at leisure.
> 
>> 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.)
> 
> There is no magic index.  Based on your description, you are going to
> be seq scanning your table a lot.  Focus on making it as small as
> possible, but vertical partitioning it so that the not-for-sale
> entries are hived off to an historical table, and horizontally
> partitioning it so that large columns rarely used in the where clause
> are in a separate table (Ideally you would tell postgresql to
> aggressively toast those columns, but there is no knob with which to
> do that)
> 
> 
>> 3. I expect that it does not make sense to index all columns?
> 
> You mean individually, or jointly?  Either way, probably not.
> 
>> 4. I expect it does not make sense to index boolean columns?
> 
> In some cases it can, for example if the data distribution is very
> lopsided and the value with the smaller side is frequently specified.
> 
>> 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them?
> 
> How often are the columns specified together?  If they are completely
> independent it probably makes little sense to index them together.
> 
>> 6. Would it be a goof idea to have all indexes sorted by my default sorting?
> 
> You don't get to choose.  An btree index is sorted by the columns
> specified in the index, according to the operators specified (or
> defaulted).  Unless you mean that you want to add the default sort
> column to be the lead column in each index, that actually might make
> sense.
> 
>> 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)?
> 
> If your queries are as unstructured as you imply, I'd forget about
> indexes for the most part, as you will have a hard time findings ones
> that work.  Concentrate on making seq scans as fast as possible.  If
> most of your queries end in something like "ORDER by price limit 10"
> then concentrate on index scans over price.  You will probably want to
> include heuristics in your UI such that if people configure queries to
> download half your database, you disallow that.  You will probably
> find that 90% of the workload comes from people who are just playing
> around with your website and don't actually intend to do business with
> you.
> 
> Cheers,
> 
> Jeff