Re: Using database to find file doublettes in my computer

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Tue, 18 Nov 2008 15:40:23 +0900

Lothar Behrens wrote:

> But how to query for files to display a 'left / right view' for each
> file that is on multible places ?

One approach is to use a query to extract the names of all files with
duplicates, and store the results in a TEMPORARY table with a UNIQUE
index (or PRIMARY KEY) on the filename.

You then self-inner-join your paths table to its self, with the join
condition being that the filename part of the path EXISTS in the table
of files having duplicates. You should also filter out rows where the
first filename is <= the second filename to avoid outputting (a,b) and
(b,a) for each duplicate, and to avoid outputting (a,a) and (b,b) rows.

You can do much the same thing in a single query without the temp table,
but I think you'll need a correlated subquery to check for duplicates
(or a join on a subquery that'll really expand the results to be
processed), so the temp table approach is probably going to be lots faster.

Quick example code (untested but should give you the idea), assumes you
have a function get_filename(...) that extracts just the filename part
of the path:

CREATE TABLE paths (
   path PRIMARY KEY,
   --- other fields
);

-- Populate paths

-- Once paths is populated, extract duplicates:

SELECT get_filename(path) AS fn, count(path) AS n
FROM paths HAVING count(path) > 1
INTO TEMPORARY TABLE dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
     ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
  AND p1.path > p2.path;

You can replace get_filename(fn) with appropriate code, but I'd write a
quick SQL function marked IMMUTABLE to neatly wrap up the pathname
extraction instead.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general