Re: faster way to calculate top "tags" for a "resource" based on a column

Marc Mamin <M.Mamin@xxxxxxxxxxxx> · Tue, 7 Oct 2014 14:02:05 +0000

>I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one.  I'm hoping someone has another strategy.
>
>I have 2 tables:
>        resource
>        resource_2_tag
>
>I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.
>
>both tables have around 1.6million records. 
>
>If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. 
>If the database doesn't need to warm up, it averages 1.76seconds. 
>
>The 1.76s time is troubling me.
>Searching for the discrete elements of this is pretty lightweight. 
>
>here's an explain --  http://explain.depesz.com/s/PndC
>
>I tried a subquery instead of a join, and the query optimized the plan to the same.
>
>i'm hoping someone will see something that I just don't see.
>
>
>
>      Table "public.resource_2_tag"
>        Column         |  Type   | Modifiers
>-----------------------+---------+-----------
> resource_id           | integer |
> tag_id                | integer |
>Indexes:
>    "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
>    "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
>                                                  Table "public.resource"
>               Column                |            Type             |                        Modifiers                        
>-------------------------------------+-----------------------------+----------------------------------------------------------
> id                                  | integer                     | not null default nextval('resource_id_seq'::regclass)
>resource_attribute1_id               | integer                     |
>lots of other columns                |                             |
>Indexes:
>        "resource_attribute1_idx" btree (resource_attribute1_id)
>
>--------------------------------------------------------------------------------
>
>select count(*) from resource;
>-- 1669729
>
>select count(*) from resource_2_tag;
>-- 1676594
>
>select count(*) from resource where resource_attribute1_id = 614;
>-- 5184
>-- 4.386ms
>
>select id from resource where resource_attribute1_id = 614;
>-- 5184
>-- 87.303ms
>
>popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
>EXPLAIN ANALYZE
>SELECT
>        resource_2_tag.tag_id AS resource_2_tag_tag_id,
>        count(resource_2_tag.tag_id) AS counted
>FROM
>        resource_2_tag
>JOIN resource ON resource.id = resource_2_tag.resource_id
>WHERE
>        resource.resource_attribute1_id = 614
>GROUP BY resource_2_tag.tag_id
>ORDER BY counted DESC
>LIMIT 25 OFFSET 0;

Hi,
it seems to me that your subquery may deliver duplicate ids.
And with the selectivity of your example, I would expect an index usage 
instead of a table scan. You may check how up to date your statistics are
and try to raise the statistic target on the column resource_2_tag.tag_id.
Also try a CTE form for your query:

WITH A as (SELECT DISTINCT id FROM resource
           WHERE resource_attribute1_id = 614
          )
SELECT
        resource_2_tag.tag_id AS resource_2_tag_tag_id,
        count(resource_2_tag.tag_id) AS counted
FROM
        resource_2_tag
JOIN A ON A.id = resource_2_tag.resource_id
ORDER BY counted DESC
LIMIT 25;

regards,
Marc Mamin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general