Re: directory tree query with big planner variation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
Am 31.07.2006 um 13:15 schrieb Michael Stone:
On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
              WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC

This can't be indexed. You might try something like WHERE P.path LIKE '%@%' AND P.path ~ '^%@/[^/]*/$'

Ignore that, I wasn't awake yet.

Why does it quite well in this case:
---------------------------------------
-> Index Scan using path_name_idx on path p (cost=0.00..3.02 rows=1 width=97) (actual time=15.480..56.935 rows=27 loops=1) Index Cond: ((path >= '/Users/axel/Library/ Preferences/'::text) AND (path < '/Users/axel/Library/ Preferences0'::text)) Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/ $'::text) AND (rtrim("replace"(path, '/Users/axel/Library/ Preferences/'::text, ''::text), '/'::text) <> ''::text))
---------------------------------------
as compared to this case(ignoring the index on path):
---------------------------------------
-> Index Scan using path_pkey on path p (cost=0.00..2567.57 rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1) Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim ("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <> ''::text))
---------------------------------------
? With all longer path names, I get the above (good)result.
Should I put the rtrim/replace on the client side?

That's not the slow part. The slow part is retrieving every single file for each of the subdirectories in order to determine whether there are any files in the subdirectories.
The schema could be a lot more intelligent here. (E.g., store path seperately from file/directory name, store type (file or directory) seperately, etc.) Without improving the schema I don't think this will ever be a speed demon.

PATH holds complete pathnames of directories, FILENAME holds filenames and pathname components. Currently the schema is the lowest common denominator between SQLite, MySQL and pg and the bacula people will stay with that (-;).

Nothing I suggested raises the bar for the "lowest common denominator". If I understand the intend of this SQL, you're pulling all the entries in a directory in two parts. The first part (files) is fairly straightforward. The second part (directories) consists of pulling any file whose parent is a subdirectory of the directory you're looking for (this is *all* children of the directory, since you have to retrieve every element that begins with the directory, then discard those that have an additional / in their name), counting how many of these there are for each subdirectory, and discarding those results except for a binary (yes there are children or no there aren't). This is a lot of useless work to go through, and is going to be slow if you've got a lot of stuff in a subdirectory. An alternative approach would be, for each directory, to store all its children (files and subdirectories) along with a flag indicating which it is. This would allow you to create the collapsed tree view without walking all the children of a subdirectory.

Assuming you can't make changes to the schema, what about the query?
You've got this:

explain analyze  SELECT X.name AS name, COUNT(CH) > 1 AS children
 FROM
   ( SELECT  RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
                                                FN.name AS CH
       FROM
         ( SELECT P.path,P.pathid
             FROM bacula.path P
             WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
         LEFT OUTER JOIN bacula.file F
           ON
             NLPC.pathid = F.pathid
         LEFT OUTER JOIN bacula.filename FN
           ON
             F.filenameid = FN.filenameid
       GROUP BY NLPC.path, FN.name
     UNION
     SELECT  FN.name AS name, FN.name AS CH
       FROM
         bacula.path P, bacula.file F, bacula.filename FN
       WHERE
         P.path = '%@/'   AND
         P.pathid = F.pathid                           AND
         F.filenameid = FN.filenameid
   ) AS X
 WHERE X.name <> ''
 GROUP BY X.name

I'm only looking at the first part, which reduces to SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
 SELECT NLPC.path AS name, FN.name as CH
   FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
          LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
          LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
          GROUP BY NLPC.path,FN.name

Why is the filename table even accessed? Would the results be the same if you did
 SELECT NLPC.path AS name, F.fileid AS CH
and drop the LEFT OUTER JOIN bacula.filename altogether?

And then what happens if you try something like SELECT X.name,X.children FROM (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F WHERE F.pathid = P.pathid
                             LIMIT 2) > 1
         FROM bacula.path P
         WHERE P.path ~ '^%@/[^/]*/$'
       UNION
       SELECT FN.name,0
         FROM bacula.path P, bacula.file F, bacula.filename FN
         WHERE
           P.path = '%@/'   AND
           P.pathid = F.pathid                           AND
           F.filenameid = FN.filenameid
       ) AS X
WHERE X.name <> ''
GROUP BY X.name

It's hard to say without knowing what's actually *in* the tables, but the existing query definately doesn't scale well for what I think it's trying to do.

Mike Stone


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux