Re: OT: Data structure design question: How do they count

Richard Huxton <dev@xxxxxxxxxxxx> · Mon, 10 Apr 2006 10:23:48 +0100

Brendan Duddridge wrote:

Now, initially I thought they would just pre-compute these counts, but 
the problem is, when you click on any of the above attribute values, 
they reduce the remaining possible set of matching products (and set of 
possible remaining attributes and attribute values) by the amount 
displayed next to the attribute value selected. You can click on any 
combination of attribute values to filter down the remaining set of 
matching products, so there's a large combination of paths you can take 
to arrive at a set of products you might be interested in.

Do you think they are pre-computed? Or do you think they might use a 
query similar to the following?:

Pre-computed almost certainly, but at what level of granularity? And 
with application-level caching?

select pav.attribute_value_id, count(p.product_id)
from product_attribute_value pav,
     attribute a,
     product p
where a.attribute_id in (some set of attribute ids) and
pav.product_id = p.product_id and
pav.attribute_id = a.attribute_id and p.product_id in
    (select product_id
     from category_product
     where category_id = some category id) and
p.is_active = 'true'
group by pav.attribute_value_id;

It would seem to me that although the above query suggests a normalized 
database structure, that joining with 3 tables plus a 4th table in the 
sub-query with an IN qualifier and grouping to get the product counts 
would take a VERY long time, especially on a possible result set of 
1,260,658 products.

Hmm - I'm not sure I'd say this was necessarily normalised. In the 
example you gave there were three definite types of attribute:
 1. Price range (< 20, 20-50, ...)
 2. Product type (lighting, rugs, ...)
 3. Store (art.com, homeannex, ...)
Your example discards this type information.

I'm also not sure it lets store A sell widgets for 19.99 and B for 25.99

So - let's look at how we might break this down into simple relations:
 product_types (product_id, prod_type, prod_subtype)
 product_availability (product_id, store_id, price_range)
and so on for each set of parameters.

Then, if PG isn't calculating fast enough I'd be tempted to throw in a 
summary table:
 product_counts(store_id, price_range, prod_type, prod_subtype, ..., 
num_products)
Then total over this for the top-level queries.

I'd also cache common top-level queries at the applicaton level anyway.

--
  Richard Huxton
  Archonet Ltd