Re: directory tree query with big planner variation

Mark Lewis <mark.lewis@xxxxxxxx> · Mon, 31 Jul 2006 06:53:21 -0700

It seems like you might be able to avoid the expensive directory lookups
entirely without changing the schema by defining an immutable function
dir_depth(path), which would just count the number of forward slashes.
Then create a functional index on dir_depth(path) and in the query do a
check for directories with a given prefix and the expected dir_depth.

In 8.1 and later, this kind of query might be able to use a bitmap index
combining thingamajigger (the actual name escapes me right now).

This is just a hunch and I haven't looked too carefully at the
schema/query requirements to see if its feasible, but seems like a
reasonable approach.

-- Mark Lewis

On Mon, 2006-07-31 at 09:30 -0400, Michael Stone wrote:
> On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
> >Am 31.07.2006 um 13:15 schrieb Michael Stone:
> >>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
> >>>               WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
> >>
> >>This can't be indexed. You might try something like WHERE P.path  
> >>LIKE '%@%' AND P.path ~ '^%@/[^/]*/$'
> 
> Ignore that, I wasn't awake yet.
> 
> >Why does it quite well in this case:
> >---------------------------------------
> >->  Index Scan using path_name_idx on path p  (cost=0.00..3.02 rows=1  
> >width=97) (actual time=15.480..56.935 rows=27 loops=1)
> >       Index Cond: ((path >= '/Users/axel/Library/ 
> >Preferences/'::text) AND (path < '/Users/axel/Library/ 
> >Preferences0'::text))
> >       Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/ 
> >$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/ 
> >Preferences/'::text, ''::text), '/'::text) <> ''::text))
> >---------------------------------------
> >as compared to this case(ignoring the index on path):
> >---------------------------------------
> >->  Index Scan using path_pkey on path p  (cost=0.00..2567.57  
> >rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
> >       Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim 
> >("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>  
> >''::text))
> >---------------------------------------
> >? With all longer path names, I get the above (good)result.
> >Should I put the rtrim/replace on the client side?
> 
> That's not the slow part. The slow part is retrieving every single file 
> for each of the subdirectories in order to determine whether there are 
> any files in the subdirectories. 
> 
> >>The schema could be a lot more intelligent here. (E.g., store path  
> >>seperately from file/directory name, store type (file or directory)  
> >>seperately, etc.) Without improving the schema I don't think this  
> >>will ever be a speed demon.
> 
> >PATH holds complete pathnames of directories, FILENAME holds  
> >filenames and pathname components.
> >Currently the schema is the lowest common denominator between SQLite,  
> >MySQL and pg and the bacula people will stay with that (-;).
> 
> Nothing I suggested raises the bar for the "lowest common denominator". 
> If I understand the intend of this SQL, you're pulling all the entries
> in a directory in two parts. The first part (files) is fairly 
> straightforward. The second part (directories) consists of pulling any 
> file whose parent is a subdirectory of the directory you're looking for 
> (this is *all* children of the directory, since you have to retrieve 
> every element that begins with the directory, then discard those that 
> have an additional / in their name), counting how many of these there 
> are for each subdirectory, and discarding those results except for a 
> binary (yes there are children or no there aren't). This is a lot of 
> useless work to go through, and is going to be slow if you've got a lot 
> of stuff in a subdirectory. An alternative approach would be, for each 
> directory, to store all its children (files and subdirectories) along 
> with a flag indicating which it is. This would allow you to create the 
> collapsed tree view without walking all the children of a subdirectory.
> 
> Assuming you can't make changes to the schema, what about the query?
> You've got this:
> 
> explain analyze  SELECT X.name AS name, COUNT(CH) > 1 AS children
>   FROM
>     ( SELECT  RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
>                                                  FN.name AS CH
>         FROM
>           ( SELECT P.path,P.pathid
>               FROM bacula.path P
>               WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
>           LEFT OUTER JOIN bacula.file F
>             ON
>               NLPC.pathid = F.pathid
>           LEFT OUTER JOIN bacula.filename FN
>             ON
>               F.filenameid = FN.filenameid
>         GROUP BY NLPC.path, FN.name
>       UNION
>       SELECT  FN.name AS name, FN.name AS CH
>         FROM
>           bacula.path P, bacula.file F, bacula.filename FN
>         WHERE
>           P.path = '%@/'   AND
>           P.pathid = F.pathid                           AND
>           F.filenameid = FN.filenameid
>     ) AS X
>   WHERE X.name <> ''
>   GROUP BY X.name
> 
> I'm only looking at the first part, which reduces to 
> SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
>   SELECT NLPC.path AS name, FN.name as CH
>     FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
>            LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
>            LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
>            GROUP BY NLPC.path,FN.name
> 
> Why is the filename table even accessed? Would the results be the 
> same if you did
>   SELECT NLPC.path AS name, F.fileid AS CH
> and drop the LEFT OUTER JOIN bacula.filename altogether?
> 
> And then what happens if you try something like 
> SELECT X.name,X.children
>  FROM 
>        (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F 
>                               WHERE F.pathid = P.pathid
>                               LIMIT 2) > 1
>           FROM bacula.path P
>           WHERE P.path ~ '^%@/[^/]*/$'
>         UNION
>         SELECT FN.name,0
>           FROM bacula.path P, bacula.file F, bacula.filename FN
>           WHERE
>             P.path = '%@/'   AND
>             P.pathid = F.pathid                           AND
>             F.filenameid = FN.filenameid
>         ) AS X
>  WHERE X.name <> ''
>  GROUP BY X.name
> 
> It's hard to say without knowing what's actually *in* the tables, but 
> the existing query definately doesn't scale well for what I think it's 
> trying to do.
> 
> Mike Stone
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster