Lothar Behrens wrote:
I have expected a smaller amount of records due to the fact that for 4
files each available 2 times (sum = 8) I
have 8 records in ECADFiles, but must have 4 in the above result.
So for an average of 2 doubles I expected half the files from
ECADFiles, because one is exactly right and the other
is on the left.
It's a combinatorial problem. If you have 3 duplicate files, eg:
INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt');
then the query process I described above will output the matches:
C:\path\file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path\file1.txt
because while it avoids showing both (A,B) and (B,A) pairs, for any A, B
and C it'll show:
(A,B)
(A,C)
(B,C)
I've attached test SQL that does the above.
Presumably, you want to only show, say:
(A,B)
(A,C)
or maybe:
(filename, A)
(filename, B)
(filename, C)
If that's what you want, you need to work a little differently. The
attached SQL in dups_test2.sql shows one way you might do it, by
generating a list of files with duplicates then listing all the
locations each appears in. Again, you can do it without the temp table,
it'll probably just be slower. None of what I've written is particularly
fast anyway - it evaluates those regular expressions many more times
than should be required, for example.
--
Craig Ringer
BEGIN;
CREATE TABLE paths (
path TEXT PRIMARY KEY
);
CREATE OR REPLACE FUNCTION get_filename(text, text) RETURNS VARCHAR AS $$
SELECT (regexp_split_to_array($1, $2))[array_upper(regexp_split_to_array($1, $2),1)];
$$ LANGUAGE SQL IMMUTABLE;
COMMENT ON FUNCTION get_filename(text, text) IS 'Extract filename part from path $1 using path separator $2';
CREATE OR REPLACE FUNCTION get_filename(text) RETURNS VARCHAR AS $$
SELECT get_filename($1, E'[/\\\\]');
$$ LANGUAGE SQL IMMUTABLE;
COMMENT ON FUNCTION get_filename(text, text) IS E'Extract filename part from path $1 using path separator / or \\';
INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt'),
(E'C:\\somewhere\\file2.txt'),
(E'/random/place/file2.txt'),
(E'/orphans/file3.blah');
COMMIT;
BEGIN;
SELECT
get_filename(path) AS fn, count(path) AS n
INTO TEMPORARY TABLE dup_files
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;
SELECT * FROM dup_files;
-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT dup_files_path_pkey PRIMARY KEY (fn);
-- Now build your side-by-side table of duplicates:
SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
AND p1.path > p2.path
ORDER BY get_filename(p1.path), p1.path, p2.path;
ROLLBACK;
SELECT get_filename(path) AS fn, min(path) AS path, count(path) AS dupcount
INTO dups
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;
SELECT * FROM dups;
SELECT dups.fn, paths.path
FROM dups INNER JOIN paths ON (dups.fn = get_filename(paths.path));
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general