Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 20 Jul 2021 08:29:17 +0200

On Thu, Jul 15 2021, Son Luong Ngoc wrote:

> Hi folks,
>
> On Wed, Jul 14, 2021 at 10:03 PM Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>>
>> *nod*
>>
>> FWIW at an ex-job I helped systems administrators who'd produced such a
>> broken backup-via-rsync create a hybrid version as an interim
>> solution. I.e. it would sync the objects via git transport, and do an
>> rsync on a whitelist (or blacklist), so pickup config, but exclude
>> objects.
>>
>> "Hybrid" because it was in a state of needing to deal with manual
>> tweaking of config.
>>
>> But usually someone who's needing to thoroughly solve this backup
>> problem will inevitably end up with wanting to drive everything that's
>> not in the object or refstore from some external system, i.e. have
>> config be generated from puppet, a database etc., ditto for alternates
>> etc.
>>
>> But even if you can't get to that point (or don't want to) I'd say aim
>> for the hybrid system.
>
> FWIW, we are running our repo on top of a some-what flickery DRBD setup and
> we decided to use both
>
>   git clone --upload-pack 'git -c transfer.hiderefs="!refs"
> upload-pack' --mirror`
>
> and
>
>   `tar`
>
> to create 2 separate snapshots for backup in parallel (full backup,
> not incremental).
>
> In case of recovery (manual), we first rely on the git snapshot and if
> there is any
> missing objects/refs, we will try to get it from the tarball.

That sounds good, and similar to what I described with that "hybrid"
setup.

>>
>> This isn't some purely theoretical concern b.t.w., the system using
>> rsync like this was producing repos that wouldn't fsck all the time, and
>> it wasn't such a busy site.
>>
>> I suspect (but haven't tried) that for someone who can't easily change
>> their backup solution they'd get most of the benefits of git-native
>> transport by having their "rsync" sync refs, then objects, not the other
>> way around. Glob order dictates that most backup systems will do
>> objects, then refs (which will of course, at that point, refer to
>> nonexisting objects).
>>
>> It's still not safe, you'll still be subject to races, but probably a
>> lot better in practice.
>
> I would love to get some guidance in official documentation on what is the best
> practice around handling git data on the server side.
>
> Is git-clone + git-bundle the go-to solution?
> Should tar/rsync not be used completely or is there a trade-off?

I should have tempered some of those comments, it's perfectly fine in
general to use tar+rsync for "backing up" git repositories in certain
contexts. E.g. when I switch laptops or whatever it's what I do to grab
data.

The problem is when the data isn't at rest, i.e. in the context of an
active server.

There you start moving towards a scale where it goes from "sure, it's
fine" to "this is such a bad idea that nobody should pursue it".

If you're running a setup where you're starting to submit patches to
git.git you're probably at the far end of that spectrum.

Whether it's clone, push, fetch, bundle etc. doesn't really matter, the
important part is that you're using git's pack transport mechanism to
ferry updates around, which gives you guarantees rsync+tar can't,
particularly in the face of concurrently updated data.