Re: RFE: hardlink: support specifying max_size too?

Karel Zak <kzak@xxxxxxxxxx> · Wed, 24 Apr 2024 11:37:03 +0200

 Hi Mikko,

On Tue, Apr 23, 2024 at 04:58:10PM +0300, Mikko Rantalainen wrote:
> I have huge directory hierarchies that I would like to run hardlink
> against but comparing a lot of files against each other results in high
> RAM usage because so much of the file metadata is kept in memory.

Good point. I have tried to optimize the content comparison (using the
kernel crypto API), but the binary tree is still the original
implementation and there is probably room for further optimization.

Perhaps storing all 'struct stat' information for every file is
excessive, as there is information that we do not need (such as atime,
ctime, st_blksize, st_blocks). Some information is only necessary if
respect_{mode,owner,time,xattrs} are enabled.

The tree also contains paths for all the files. If you have many
subdirectories or long directory names, there is a lot of duplicate
data in the binary tree. One possible solution could be to keep
directory paths in a separate hash table and only store pointers to
the names table in the metadata tree.

Another problem I see is that the hardlink keeps the entire
binary tree in memory during the second stage when it compares file
contents in the visitor() function. However, at this point, we do not
need the tree entries that are already unique and will never be used
to compare file contents.

> Could you add max_size (--maximum-size) option in addition to min_size
> (--minimum-size)? This would allow splitting the work into small
> fragments where hardlink only needs to process files in given range and
> immediately ignore all other files. Or it could be used to run full
> linking in multiple parallel tasks with sensible RAM requirements if you
> can run hardlink without size limitations (e.g. one task for 1–1MB
> files, another for 1MB–10MB and third task for files bigger than 10MB).

This is not a trivial task. It would be better to begin with
optimizing memory usage before implementing more invasive changes. 

I am unsure how you plan to compare all files if the metadata is
stored in multiple independent trees.

> It might also make sense to reorder the test for filesize and regex
> processing in inserter() because testing for size is probably faster
> because the stat() has already been made. Currently the stats.files is
> also increased for files that get ignored by size filter which may not
> be intentional.

Good point, send patch :-)

> I think I could provide patches if I just know which Git repo I should
> use as the basis. Is https://github.com/util-linux/util-linux the
> correct one?

Yes, GitHub is the best repository. You can also use it for pull
requests and reviews.

My suggestion is to add debug messages to see where the problem is,
calculate the size of metadata, the size of paths, and the size of
calculated data checksums. Please share the results.

    Karel

-- 
 Karel Zak  <kzak@xxxxxxxxxx>
 http://karelzak.blogspot.com