Search Postgresql Archives

Optimising "full outer join where" for muti-row to multi-column view

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear PostgreSQL experts,

I have a database that records the EXIF data for a collection of 
photos.  (EXIF is a method for embedding arbitary name-value data 
in a JPEG, and digital cameras typically use it to record things 
like exposure information.)  My exif table looks something like 
this:

 photo_id |           tag            |          value                     
----------+--------------------------+--------------------------
     1001 | DateTime                 | 2006:10:26 11:19:29
     1001 | Orientation              | top - left
     1001 | PixelXDimension          | 3888
     1001 | PixelYDimension          | 2592
     1002 | DateTimeOriginal         | 2006:10:26 13:34:06
     1002 | Orientation              | left - bottom
     1002 | PixelXDimension          | 3888
     1002 | PixelYDimension          | 2592

photo_id and tag together form the primary key.  The data also 
includes many tags that I'm not currently interested in.  From 
this I create a view containing only the tags of interest:

 photo_id |  orientation  |      datetime       | xsize | ysize 
----------+---------------+---------------------+-------+-------
     1001 | top - left    | 2006:10:30 11:19:29 | 3888  | 2592
     1002 | left - bottom | 2006:10:26 13:34:06 | 3888  | 2592

My first attempt did this in the obvious way by joining the 
exif table with itself once per column:

create view photo_info_v as
select photo_id,
       e1.value as orientation, 
       e2.value as datetime, 
       e3.value as xsize,
       e4.value as ysize
  from exif e1
  join exif e2 using(photo_id)
  join exif e3 using(photo_id)
  join exif e4 using(photo_id)
  where e1.tag='Orientation' 
    and e2.tag='DateTime' 
    and e3.tag='PixelXDimension' 
    and e4.tag='PixelYDimension';

This works well for one important query, where I find one photo's 
information from the view:

photos=> explain analyse select * from photo_info_v where photo_id=1201;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..12.09 rows=1 width=60) (actual time=1.198..1.290 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=46) (actual time=0.953..1.009 rows=1 loops=1)
         ->  Nested Loop  (cost=0.00..6.04 rows=1 width=32) (actual time=0.693..0.731 rows=1 loops=1)
               ->  Index Scan using exif_pkey on exif e1  (cost=0.00..3.02 rows=1 width=18) (actual time=0.384..0.394 rows=1 loops=1)
                     Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
               ->  Index Scan using exif_pkey on exif e4  (cost=0.00..3.02 rows=1 width=18) (actual time=0.189..0.205 rows=1 loops=1)
                     Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
         ->  Index Scan using exif_pkey on exif e3  (cost=0.00..3.02 rows=1 width=18) (actual time=0.186..0.194 rows=1 loops=1)
               Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
   ->  Index Scan using exif_pkey on exif e2  (cost=0.00..3.02 rows=1 width=18) (actual time=0.171..0.195 rows=1 loops=1)
         Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
 Total runtime: 3.064 ms

However, I might just want one column from the view:

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..12.09 rows=1 width=14) (actual time=1.266..1.353 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=18) (actual time=1.009..1.062 rows=1 loops=1)
         ->  Nested Loop  (cost=0.00..6.04 rows=1 width=18) (actual time=0.752..0.787 rows=1 loops=1)
               ->  Index Scan using exif_pkey on exif e1  (cost=0.00..3.02 rows=1 width=18) (actual time=0.410..0.418 rows=1 loops=1)
                     Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
               ->  Index Scan using exif_pkey on exif e4  (cost=0.00..3.02 rows=1 width=4) (actual time=0.183..0.199 rows=1 loops=1)
                     Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
         ->  Index Scan using exif_pkey on exif e3  (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.176 rows=1 loops=1)
               Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
   ->  Index Scan using exif_pkey on exif e2  (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.191 rows=1 loops=1)
         Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
 Total runtime: 3.123 ms

I only wanted the orientation information, which could be found 
using this faster query:

photos=> explain analyse select value from exif where photo_id=1201 and tag='Orientation';
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Scan using exif_pkey on exif  (cost=0.00..3.02 rows=1 width=14) (actual time=0.504..0.536 rows=1 loops=1)
   Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
 Total runtime: 1.298 ms

Instead it builds up all four columns of the view and discards three 
of them.  It has to do this, because if data for any one of the four 
columns did not exist then the whole row would not exist in the view.  
I know that all of the data is always present, but PostgreSQL doesn't 
know that, and I don't think there is any way that I can tell it that
it does.

To try and avoid this, I tried using a full outer join in the view 
definition.  In this case the row would always be present in the view, 
even if the data for the other columns were not present.  I hoped that 
the query would then be optimised to look up only the orientation 
information:

create view photo_info_v as
select photo_id,
       e1.value as orientation, 
       e2.value as datetime, 
       e3.value as xsize,
       e4.value as ysize
  from exif e1
  full outer join exif e2 using(photo_id)
  full outer join exif e3 using(photo_id)
  full outer join exif e4 using(photo_id)
  where e1.tag='Orientation' 
    and e2.tag='DateTime' 
    and e3.tag='PixelXDimension' 
    and e4.tag='PixelYDimension';

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=2431.73..3198.49 rows=7 width=14) (actual time=2256.194..2345.563 rows=1 loops=1)
   Merge Cond: ("outer".photo_id = "inner"."?column5?")
   Join Filter: (COALESCE("inner"."?column5?", "outer".photo_id) = 1201)
   ->  Index Scan using exif_pkey on exif e4  (cost=0.00..740.62 rows=1229 width=4) (actual time=1.937..462.014 rows=1229 loops=1)
         Index Cond: (tag = 'PixelYDimension'::text)
   ->  Sort  (cost=2431.73..2434.99 rows=1302 width=26) (actual time=1766.523..1802.629 rows=1229 loops=1)
         Sort Key: COALESCE(COALESCE(e1.photo_id, e2.photo_id), e3.photo_id)
         ->  Merge Join  (cost=1601.15..2364.38 rows=1302 width=26) (actual time=931.551..1604.342 rows=1229 loops=1)
               Merge Cond: ("outer".photo_id = "inner"."?column4?")
               ->  Index Scan using exif_pkey on exif e3  (cost=0.00..740.62 rows=1229 width=4) (actual time=0.240..519.953 rows=1229 loops=1)
                     Index Cond: (tag = 'PixelXDimension'::text)
               ->  Sort  (cost=1601.15..1604.37 rows=1286 width=22) (actual time=931.092..971.617 rows=1229 loops=1)
                     Sort Key: COALESCE(e1.photo_id, e2.photo_id)
                     ->  Merge Join  (cost=0.00..1534.74 rows=1286 width=22) (actual time=0.467..797.332 rows=1229 loops=1)
                           Merge Cond: ("outer".photo_id = "inner".photo_id)
                           ->  Index Scan using exif_pkey on exif e1  (cost=0.00..740.62 rows=1229 width=18) (actual time=0.193..271.369 rows=1229 loops=1)
                                 Index Cond: (tag = 'Orientation'::text)
                           ->  Index Scan using exif_pkey on exif e2  (cost=0.00..740.62 rows=1229 width=4) (actual time=0.118..395.892 rows=1229 loops=1)
                                 Index Cond: (tag = 'DateTime'::text)
 Total runtime: 2350.601 ms

Ooops!  It has got worse not better.  The runtime has increased by 
three orders of magnitude, because it seems to compute most of the 
view before filtering.  It boils down to this query:

photos=> select e1.value as orientation 
photos->   from exif e1
photos->   full outer join exif e2 using(photo_id)
photos->   full outer join exif e3 using(photo_id)
photos->   full outer join exif e4 using(photo_id)
photos->   where e1.tag='Orientation' 
photos->     and e2.tag='DateTime' 
photos->     and e3.tag='PixelXDimension' 
photos->     and e4.tag='PixelYDimension'
photos->     and photo_id=1201;

I would like it to optimise away the unused joins, leaving:

photos=> explain analyse select e1.value as orientation 
photos->   from exif e1
photos->   where e1.tag='Orientation' 
photos->   and photo_id=1201;

Is there a good reason why it cannot do this?  Remeber that photo_id and 
tag are the primary key for exif, so the view cannot have more than one
row per photo_id.


This is not the first problem I have had with non-trivial views, where 
PostgreSQL has failed to simplify a query on the view as I had hoped it
would, either because the semantics of SQL mean that it is unable to
(with no way of describing the additional constraints that apply to
that data and could make the simplification possible), or because the
query optimiser doesn't detect a possible optimisation.  Back in 7.4
days I resorted to materialised views updated using triggers, which
involved enormously more work than a normal view.  Has anything been 
done, or is anything planned, to make this easier?

I imagine that the basic problem here, projecting data from multiple 
rows into a single row with multiple columns, is a fairly common one.  
Are there any tricks to doing this that anyone can share?

I'm using PostgreSQL 8.1.

Many thanks,

Phil.

(You are welcome to CC: me in any replies.)





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux