Re: UNION versus SUB SELECT

Robert DiFalco <robert.difalco@xxxxxxxxx> · Thu, 21 Nov 2013 14:04:02 -0800

Hmmmm...I'm not sure why the buffers option didn't work for me, maybe the heroku psql is out of date. No, the query gets slower with a high load of data and runs pretty often.
I just created a small test dataset for this. When I have a larger one I will post new explain plans but the timings seem pretty consistent regardless of the results returns (usually only 2-200) even when there are millions of records in "contacts", "phone_numbers", and "email_addresses".

In this case doesn't the correlated query have to do more work and access more columns than the subselect approach?

On Thu, Nov 21, 2013 at 1:22 PM, Elliot <yields.falsehood@xxxxxxxxx> wrote:

    On 2013-11-21 16:12, Robert DiFalco
      wrote:

      Sorry I couldn't get buffers to work but here is
        the explain analyze verbose:

          dft1fjfv106r48=> explain analyze verbose select c.*  

                                                                 from
            contacts c                                                  

                                                     where c.id IN (  

                                             select p.contact_id        

                               from phone_numbers p                    

                  where (p.national = 5038904993 and p.e164 =
            '+15038904993'))                                            
                                                                      or
            c.id IN (  

                                                select e.contact_id    

                                   from email_addresses e              

                      where e.email = 'robert.xxxxx@xxxxxxxxx')

             ;

                       QUERY PLAN                                      

          ----------------------------------------------------------------------------------------------------------------------------------------------------
           Seq Scan on public.contacts c  (cost=8.12..75.73
            rows=1988 width=95) (actual time=0.410..0.410 rows=0
            loops=1)
             Output: c.id,
            c.owner_id, c.user_id, c.device_id, c.last_call,
            c.record_id, c.dtype, c.blocked, c.details_hash, c.fname,
            c.lname, c.fb_id
             Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
             Rows Removed by Filter: 2849
             SubPlan 1
               ->  Index Scan using idx_phone_address on
            public.phone_numbers p  (cost=0.06..4.06 rows=1 width=8)
            (actual time=0.015..0.015 rows=0 loops=1)
                     Output: p.contact_id
                     Index Cond: ((p."national" =
            5038904993::bigint) AND ((p.e164)::text =
            '+15038904993'::text))
             SubPlan 2
               ->  Index Scan using idx_email_address on
            public.email_addresses e  (cost=0.06..4.06 rows=1 width=8)
            (actual time=0.018..0.018 rows=0 loops=1)
                     Output: e.contact_id
                     Index Cond: ((e.email)::text = 'robert.xxxxx@xxxxxxxxx'::text)
           Total runtime: 0.489 ms
          (13 rows)

          dft1fjfv106r48=> explain analyze verbose select c.*  

                                                                 from
            contacts c                                                  

                                                     where exists(      

                                           select 1                    

                              from phone_numbers p                      

                where (p.national = 5038904993 and p.e164 =
            '+15038904993') and p.contact_id = c.id)      

                        or EXISTS(                                      

              select 1                                                  

                                                            from
            email_addresses e                                          

                                                      where e.email = 'robert.xxxxx@xxxxxxxxx'
            and e.contact_id = c.id)                              

                   ;

                        QUERY PLAN                                      

          ------------------------------------------------------------------------------------------------------------------------------------------------------
           Seq Scan on public.contacts c  (cost=0.00..21596.38
            rows=1988 width=95) (actual time=0.479..0.479 rows=0
            loops=1)
             Output: c.id,
            c.owner_id, c.user_id, c.device_id, c.last_call,
            c.record_id, c.dtype, c.blocked, c.details_hash, c.fname,
            c.lname, c.fb_id
             Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2)
            OR (alternatives: SubPlan 3 or hashed SubPlan 4))
             Rows Removed by Filter: 2849
             SubPlan 1
               ->  Index Scan using idx_phone_address on
            public.phone_numbers p  (cost=0.06..4.06 rows=1 width=0)
            (never executed)
                     Index Cond: ((p."national" =
            5038904993::bigint) AND ((p.e164)::text =
            '+15038904993'::text))
                     Filter: (p.contact_id = c.id)
             SubPlan 2
               ->  Index Scan using idx_phone_address on
            public.phone_numbers p_1  (cost=0.06..4.06 rows=1 width=8)
            (actual time=0.010..0.010 rows=0 loops=1)
                     Output: p_1.contact_id
                     Index Cond: ((p_1."national" =
            5038904993::bigint) AND ((p_1.e164)::text =
            '+15038904993'::text))
             SubPlan 3
               ->  Index Scan using idx_email_address on
            public.email_addresses e  (cost=0.06..4.06 rows=1 width=0)
            (never executed)
                     Index Cond: ((e.email)::text = 'robert.xxxxx@xxxxxxxxx'::text)
                     Filter: (e.contact_id = c.id)
             SubPlan 4
               ->  Index Scan using idx_email_address on
            public.email_addresses e_1  (cost=0.06..4.06 rows=1 width=8)
            (actual time=0.016..0.016 rows=0 loops=1)
                     Output: e_1.contact_id
                     Index Cond: ((e_1.email)::text = 'robert.xxxxx@xxxxxxxxx'::text)
           Total runtime: 0.559 ms

            (21 rows)

          dft1fjfv106r48=> explain analyze verbose select * from
            contacts where id IN (                                      
                                                               (select c.id        

                                      from contacts c                  

                         join phone_numbers p on c.id =
            p.contact_id and p.national = 5038904993 and p.e164 =
            '+15038904993')                                            
                                  union (select c.id        

                                from contacts c                        

                   join email_addresses e on c.id = e.contact_id and e.email = 'robert.xxxxx@xxxxxxxxx'));

                     QUERY PLAN                                        

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           Nested Loop  (cost=16.31..24.39 rows=2 width=95) (actual
            time=0.060..0.060 rows=0 loops=1)
             Output: contacts.id,
            contacts.owner_id, contacts.user_id, contacts.device_id,
            contacts.last_call, contacts.record_id, contacts.dtype,
            contacts.blocked, contacts.details_hash, contacts.fname,
            contacts.lname, contacts.fb_id
             ->  Unique  (cost=16.26..16.26 rows=2 width=8)
            (actual time=0.057..0.057 rows=0 loops=1)
                   Output: c.id
                   ->  Sort  (cost=16.26..16.26 rows=2 width=8)
            (actual time=0.055..0.055 rows=0 loops=1)
                         Output: c.id
                         Sort Key: c.id
                         Sort Method: quicksort  Memory: 25kB
                         ->  Append  (cost=0.11..16.25 rows=2
            width=8) (actual time=0.034..0.034 rows=0 loops=1)
                               ->  Nested Loop  (cost=0.11..8.12
            rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
                                     Output: c.id
                                     ->  Index Scan using
            idx_phone_address on public.phone_numbers p
             (cost=0.06..4.06 rows=1 width=8) (actual time=0.011..0.011
            rows=0 loops=1)
                                           Output: p.id,
            p.contact_id, p."national", p.e164, p.raw_number
                                           Index Cond:
            ((p."national" = 5038904993::bigint) AND ((p.e164)::text =
            '+15038904993'::text))
                                     ->  Index Only Scan using
            idx_contacts_pkey_owner on public.contacts c
             (cost=0.06..4.06 rows=1 width=8) (never executed)
                                           Output: c.id,
            c.owner_id, c.user_id
                                           Index Cond: (c.id =
            p.contact_id)
                                           Heap Fetches: 0
                               ->  Nested Loop  (cost=0.11..8.12
            rows=1 width=8) (actual time=0.018..0.018 rows=0 loops=1)
                                     Output: c_1.id
                                     ->  Index Scan using
            idx_email_address on public.email_addresses e
             (cost=0.06..4.06 rows=1 width=8) (actual time=0.016..0.016
            rows=0 loops=1)
                                           Output: e.id,
            e.contact_id, e.email
                                           Index Cond:
            ((e.email)::text = 'robert.xxxxx@xxxxxxxxx'::text)
                                     ->  Index Only Scan using
            idx_contacts_pkey_owner on public.contacts c_1
             (cost=0.06..4.06 rows=1 width=8) (never executed)
                                           Output: c_1.id,
            c_1.owner_id, c_1.user_id
                                           Index Cond: (c_1.id =
            e.contact_id)
                                           Heap Fetches: 0
             ->  Index Scan using idx_contacts_pkey_owner on
            public.contacts  (cost=0.06..4.06 rows=1 width=95) (never
            executed)
                   Output: contacts.id,
            contacts.owner_id, contacts.user_id, contacts.device_id,
            contacts.last_call, contacts.record_id, contacts.dtype,
            contacts.blocked, contacts.details_hash, contacts.fname,
            contacts.lname, contacts.fb_id
                   Index Cond: (contacts.id = c.id)
           Total runtime: 0.332 ms
          (31 rows)

    The buffers option is 9.0+ and is used like "explain (analyze,
    verbose, buffers) select 1".

    To your original question, the union output there runs slightly
    faster than the "in" approach, although this may not be a good
    example - your inputs don't return any data, so this might not be
    realistic - and those numbers are so low that the difference might
    just be noise.

    Are you tuning a <0.5ms-runtime query? Or is this just curiosity?
    FWIW I tend to write queries like this using an exists check first,
    then if that's still not good enough (all things like proper
    indexing taken in to account) I'll try an in check, then finally a
    union if that's still not good enough.