Re: directory tree query with big planner variation

Michael Stone <mstone+postgres@xxxxxxxxx> · Mon, 31 Jul 2006 09:30:37 -0400

On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
Am 31.07.2006 um 13:15 schrieb Michael Stone:
On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
              WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC

This can't be indexed. You might try something like WHERE P.path  
LIKE '%@%' AND P.path ~ '^%@/[^/]*/$'

Ignore that, I wasn't awake yet.

Why does it quite well in this case:
---------------------------------------
->  Index Scan using path_name_idx on path p  (cost=0.00..3.02 rows=1  
width=97) (actual time=15.480..56.935 rows=27 loops=1)
      Index Cond: ((path >= '/Users/axel/Library/ 
Preferences/'::text) AND (path < '/Users/axel/Library/ 
Preferences0'::text))
      Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/ 
$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/ 
Preferences/'::text, ''::text), '/'::text) <> ''::text))
---------------------------------------
as compared to this case(ignoring the index on path):
---------------------------------------
->  Index Scan using path_pkey on path p  (cost=0.00..2567.57  
rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
      Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim 
("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>  
''::text))
---------------------------------------
? With all longer path names, I get the above (good)result.
Should I put the rtrim/replace on the client side?

That's not the slow part. The slow part is retrieving every single file 
for each of the subdirectories in order to determine whether there are 
any files in the subdirectories. 

The schema could be a lot more intelligent here. (E.g., store path  
seperately from file/directory name, store type (file or directory)  
seperately, etc.) Without improving the schema I don't think this  
will ever be a speed demon.

PATH holds complete pathnames of directories, FILENAME holds  
filenames and pathname components.
Currently the schema is the lowest common denominator between SQLite,  
MySQL and pg and the bacula people will stay with that (-;).

Nothing I suggested raises the bar for the "lowest common denominator". 
If I understand the intend of this SQL, you're pulling all the entries
in a directory in two parts. The first part (files) is fairly 
straightforward. The second part (directories) consists of pulling any 
file whose parent is a subdirectory of the directory you're looking for 
(this is *all* children of the directory, since you have to retrieve 
every element that begins with the directory, then discard those that 
have an additional / in their name), counting how many of these there 
are for each subdirectory, and discarding those results except for a 
binary (yes there are children or no there aren't). This is a lot of 
useless work to go through, and is going to be slow if you've got a lot 
of stuff in a subdirectory. An alternative approach would be, for each 
directory, to store all its children (files and subdirectories) along 
with a flag indicating which it is. This would allow you to create the 
collapsed tree view without walking all the children of a subdirectory.

Assuming you can't make changes to the schema, what about the query?
You've got this:

explain analyze  SELECT X.name AS name, COUNT(CH) > 1 AS children
 FROM
   ( SELECT  RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
                                                FN.name AS CH
       FROM
         ( SELECT P.path,P.pathid
             FROM bacula.path P
             WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
         LEFT OUTER JOIN bacula.file F
           ON
             NLPC.pathid = F.pathid
         LEFT OUTER JOIN bacula.filename FN
           ON
             F.filenameid = FN.filenameid
       GROUP BY NLPC.path, FN.name
     UNION
     SELECT  FN.name AS name, FN.name AS CH
       FROM
         bacula.path P, bacula.file F, bacula.filename FN
       WHERE
         P.path = '%@/'   AND
         P.pathid = F.pathid                           AND
         F.filenameid = FN.filenameid
   ) AS X
 WHERE X.name <> ''
 GROUP BY X.name

I'm only looking at the first part, which reduces to 
SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
 SELECT NLPC.path AS name, FN.name as CH
   FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
          LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
          LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
          GROUP BY NLPC.path,FN.name

Why is the filename table even accessed? Would the results be the 
same if you did
 SELECT NLPC.path AS name, F.fileid AS CH
and drop the LEFT OUTER JOIN bacula.filename altogether?

And then what happens if you try something like 
SELECT X.name,X.children
FROM 
      (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F 
                             WHERE F.pathid = P.pathid
                             LIMIT 2) > 1
         FROM bacula.path P
         WHERE P.path ~ '^%@/[^/]*/$'
       UNION
       SELECT FN.name,0
         FROM bacula.path P, bacula.file F, bacula.filename FN
         WHERE
           P.path = '%@/'   AND
           P.pathid = F.pathid                           AND
           F.filenameid = FN.filenameid
       ) AS X
WHERE X.name <> ''
GROUP BY X.name

It's hard to say without knowing what's actually *in* the tables, but 
the existing query definately doesn't scale well for what I think it's 
trying to do.

Mike Stone