Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 14 Jul 2021 21:32:26 +0200

On Wed, Jul 14 2021, Taylor Blau wrote:

> On Wed, Jul 14, 2021 at 08:19:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> The reason why we want to avoid freshen the mtime of ".pack" file is to
>> >> improve the reading speed of Git Servers.
>> >
>> > That's surprising behavior to me. Are you saying that calling utime(2)
>> > causes the *page* cache to be invalidated and that most reads are
>> > cache-misses lowering overall IOPS?
>>
>> I think you may be right narrowly, but wrong in this context :)
>>
>> I.e. my understanding of this problem is that they have some incremental
>> backup job, e.g. rsync without --checksum (not that doing that would
>> help, chicken & egg issue)..
>
> Ah, thanks for explaining. That's helpful, and changes my thinking.
>
> Ideally, Sun would be able to use --checksum (if they are using rsync)
> or some equivalent (if they are not). In other words, this seems like a
> problem that Git shouldn't be bending over backwards for.

Even with my strong opinions about rsync being bad for this use-case, in
practice it does work for a lot of people, e.g. with nightly jobs
etc. Not everyone's repository is insanely busy, where I've mainly seen
this sort of corruption.

In that case making people use --checksum is borderline inhumane :)

> But if that isn't possible, then I find introducing a new file to
> redefine the pack's mtime just to accommodate a backup system that
> doesn't know better to be a poor justification for adding this
> complexity. Especially since we agree that rsync-ing live Git
> repositories is a bad idea in the first place ;).
>
> If it were me, I would probably stop here and avoid pursuing this
> further. But an OK middle ground might be core.freshenPackfiles=<bool>
> to indicate whether or not packs can be freshened, or the objects
> contained within them should just be rewritten loose.
>
> Sun could then set this configuration to "false", implying:
>
>   - That they would have more random loose objects, leading to some
>     redundant work by their backup system.
>   - But they wouldn't have to resync their huge packfiles.
>
> ...and we wouldn't have to introduce any new formats/file types to do
> it. To me, that seems like a net-positive outcome.

This approach is getting quite close to my core.checkCollisions patch,
to the point of perhaps being indistinguishable in practice:
https://lore.kernel.org/git/20181028225023.26427-5-avarab@xxxxxxxxx/

I.e. if you're happy to re-write out duplicate objects then you're going
to be ignoring the collision check and don't need to do it. It's not the
same in that you might skip writing objects you know are reachable, and
with the collisions check off and not-so-thin packs you will/might get
more redundancy than you asked for.

But in practice with modern clients mostly/entirely sending you just the
things you need in the common case it might be close enough.

I mean it addresses the expiry race that a unreachable-becomes-reachable
again race would be "solved" by just re-writing that data, hrm, but
there's probably aspects of that race I'm not considering.

Anyway, in the sense that for a lot of systems syncing file additions is
a lot cheaper than rewrites it might get you what you want...