Re: [PATCH v3 3/9] finalize_object_file(): implement collision check

Jeff King <peff@xxxxxxxx> · Tue, 10 Sep 2024 02:53:33 -0400

On Fri, Sep 06, 2024 at 02:44:12PM -0700, Junio C Hamano wrote:

> > +static int check_collision(const char *filename_a, const char *filename_b)
> [...]
> 
> Two and two half comments on this function.
> 
>  * We compare 4k at a time here, while copy.c copies 8k at a time,
>    and bulk-checkin.c uses 16k at a time.  Outside the scope of this
>    topic, we probably should pick one number and stick to it, unless
>    we have measured to pick perfect number for each case (and I know
>    I picked 8k for copy.c and 16k for bulk-checkin.c both out of
>    thin air).

I can confirm that 4k was picked out of thin air by me while writing
this function. ;) It's my usual size for I/O just because it often
matches the page size. I wouldn't be surprised if other values are
faster, but I hope that this code would run pretty infrequently.

So I wouldn't worry too much about it, but if there's an obviously
better I/O size and we have a #define for it, it would make sense to use
it. I do wonder if we'd run into stack limitations at some point.

As an alternative, we could also mmap the files and then compare the
full arrays directly. That might be faster?

>  * I would have expected at least we would fstat() them to declare
>    difference immediately after we find their sizes differ, for
>    example.  As we assume that calling into this function should be
>    rare, we prefer not to pay in complexity for performance here?

I actually had the opposite thought about fstat() performance while
writing it.  In a world where you expect most name collisions to
actually have different content, then checking their sizes lets you skip
the more expensive byte-wise check. But in a world where you mostly
expect them _not_ to differ, then the size check usually does not help,
and you waste time doing it. I.e., it is a cache-miss problem with
weighted costs.

Now thinking on it more, that view is probably dumb. fstat() is really
cheap, and byte-wise comparisons are really expensive, so if it has even
a tiny chance of helping, it might be worth doing.

Though again, I'd hope this will trigger pretty rarely in practice,
because it's probably a sign that we could have skipped work earlier
(i.e., by realizing we were just going to generate an identical file,
and not generated it in the first place).

So I'd be content with just about any implementation, and waiting to see
if it ever becomes a performance pain point.

>  * We use read_in_full() and assume that a short-read return from
>    the function happens only at the end of file due to EOF, which is
>    another reason why we can do away without fstat() on these files.
> 
>  * An error causes the caller to assume collision (because we assign
>    the return value of error() to ret), which should do the same
>    action as an actual collision to abort and keep the problematic
>    file for forensics.

Yup, both of those were very intentional. :)

-Peff