Re: [PATCH 2/4] hardlink: add --list-duplicates and --zero

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 31, 2024 at 09:58:19AM +0100, Karel Zak wrote:
> On Mon, Oct 28, 2024 at 07:19:30PM GMT, наб wrote:
> > --list-duplicates codifies what everyone keeps re-implementing with
> > find -exec b2sum or src:perforate's finddup or whatever.
> > 
> > hardlink already knows this, so make the data available thusly,
> > in a format well-suited for pipeline processing
> > (fixed-width key for uniq/cut/&c.,
> >  tab delimiter for cut &a.,
> >  -z for correct filename handling).
> 
> Why do we need a 16-byte discriminator? The list consists of absolute
> paths, so it should be unique enough. This seems like an unusual
> thing,
Well, the point is to have a list of lists of files, right.
hardlink(1) finds, within the given domain,
a set of sets of "these files are identical"
(or, the logical set of "these are the link names of this file"
 for all eligible files).
The only way to flatten this is to a single-layer list is by having a
list of filenames discriminated by the set in which they belong, so
  [[a, b], [c, d, e]]
discriminated as
  0 a
  0 b
  1 c
  1 d
  1 e
which allows you to reconstuct the sets live while stream-processing
(the implementation uses a unique ASLR-randomised discriminator
 because the order isn't stable anyway I think? but same difference).

A list of just filenames is useless.

> as I cannot recall any other tool that uses something like
> this.
This is what the b2sum/sha1sum/&c. family does.
(And, in a worse and less structured manner, sum/cksum.)
If you were to implement this with one of those,
you'd do something like
  find -type f -exec b2sum {} + | sort | uniq -Dw128
which works but has other issues
(not tab-delimited, slow, harder than necessary to configure,
 actually you want to sprinkle -z everywhere, &c.).

There's no other commonly-accepted program that does this,
I want to say it's because (a) hardlink is The Util-Linux Implementation
which doesn't necessarily exclude others, but certainly discourages them,
(b) hardlink doesn't tell you, so (c) if you're querying something
in a way that hardlink doesn't support,
you're doing it ad-hoc with whatever you think of,
and you're wondering why hardlink won't just tell you.

Debian has, in src:perforate, finddup, which implements this.
It's very much 1996 (it reads the whole file into memory, in Perl,
before uniquifying by MD5(!)), and the output format is
  84 './build-output/dsh-0.25.10.obsolete.1730308699.8166876/debian/watch' './build-output/dsh-0.25.10.obsolete.1730308753.583969/debian/watch'
  84 './build-output/dsh-0.25.10.obsolete.1730306971.4697168/debian/watch' './build-output/dsh-0.25.10.obsolete.1730306296.9378986/debian/watch' './build-output/dsh-0.25.10.obsolete.1730306808.9797611/debian/watch'
which is not in any way useful (the prefix is the size).

This then lets you process the equivalence sets separately
(I hope to replace this monstrosity I run commonly:
   find -exec b2sum {} + | sort | mawk '{h = substr($0, 1, 128); fn = substr($0, 1 + 128 + 2);  if(h == hash) {tgt = "." fname; split(fn, curs, "/"); if(curs[2] == fnames[2]) tgt = fnames[3];  print "[ -s \"" fn "\" ] && ln -sf -- \"" tgt "\" \"" fn "\""} else {hash = h; fname = fn; split(fname, fnames, "/")}}'  | sh
 with something hardlink--l-based.
 Actually this would want hardlink -L ideally;
 would you accept a patch that adds -L?).

On Thu, Oct 31, 2024 at 09:51:00AM +0100, Karel Zak wrote:
> The new option should also be added to the "bash-completion/hardlink"
> file. However, I can fix this after merging locally.
I missed this. I'll include it in v2 if we get to v2 but if we don't,
please do, thanks.

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux