Re: [PATCH 2/4] hardlink: add --list-duplicates and --zero

Karel Zak <kzak@xxxxxxxxxx> · Tue, 5 Nov 2024 11:20:12 +0100

On Thu, Oct 31, 2024 at 03:56:40PM GMT, наб wrote:
> On Thu, Oct 31, 2024 at 09:58:19AM +0100, Karel Zak wrote:
> > On Mon, Oct 28, 2024 at 07:19:30PM GMT, наб wrote:
> > > --list-duplicates codifies what everyone keeps re-implementing with
> > > find -exec b2sum or src:perforate's finddup or whatever.
> > > 
> > > hardlink already knows this, so make the data available thusly,
> > > in a format well-suited for pipeline processing
> > > (fixed-width key for uniq/cut/&c.,
> > >  tab delimiter for cut &a.,
> > >  -z for correct filename handling).
> > 
> > Why do we need a 16-byte discriminator? The list consists of absolute
> > paths, so it should be unique enough. This seems like an unusual
> > thing,
> Well, the point is to have a list of lists of files, right.
> hardlink(1) finds, within the given domain,
> a set of sets of "these files are identical"
> (or, the logical set of "these are the link names of this file"
>  for all eligible files).
> The only way to flatten this is to a single-layer list is by having a
> list of filenames discriminated by the set in which they belong, so
>   [[a, b], [c, d, e]]
> discriminated as
>   0 a
>   0 b
>   1 c
>   1 d
>   1 e
> which allows you to reconstuct the sets live while stream-processing
> (the implementation uses a unique ASLR-randomised discriminator
>  because the order isn't stable anyway I think? but same difference).
> 
> A list of just filenames is useless.

I see, thanks.

> On Thu, Oct 31, 2024 at 09:51:00AM +0100, Karel Zak wrote:
> > The new option should also be added to the "bash-completion/hardlink"
> > file. However, I can fix this after merging locally.
> I missed this. I'll include it in v2 if we get to v2 but if we don't,
> please do, thanks.

Merged and bash-completion updated.

    Karel

-- 
 Karel Zak  <kzak@xxxxxxxxxx>
 http://karelzak.blogspot.com