Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 20 Jul 2021 08:32:35 +0200

On Wed, Jul 14 2021, Martin Fick wrote:

> On Wednesday, July 14, 2021 9:41:42 PM MDT you wrote:
>> On Wed, Jul 14 2021, Martin Fick wrote:
>> > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote:
>> >> The best way to get backups of git repositories you know are correct are
>> >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or
>> >> create bundles from it.
>> > 
>> > I don't think this is a fair recommendation since unfortunately, this
>> > cannot be used to create a full backup. This can be used to back up the
>> > version controlled data, but not the repositories meta-data, i.e.
>> > configs, reflogs, alternate setups...
>> 
>> *nod*
>> 
>> FWIW at an ex-job I helped systems administrators who'd produced such a
>> broken backup-via-rsync create a hybrid version as an interim
>> solution. I.e. it would sync the objects via git transport, and do an
>> rsync on a whitelist (or blacklist), so pickup config, but exclude
>> objects.
>> 
>> "Hybrid" because it was in a state of needing to deal with manual
>> tweaking of config.
>> 
>> But usually someone who's needing to thoroughly solve this backup
>> problem will inevitably end up with wanting to drive everything that's
>> not in the object or refstore from some external system, i.e. have
>> config be generated from puppet, a database etc., ditto for alternates
>> etc.
>> 
>> But even if you can't get to that point (or don't want to) I'd say aim
>> for the hybrid system.
>> 
>> This isn't some purely theoretical concern b.t.w., the system using
>> rsync like this was producing repos that wouldn't fsck all the time, and
>> it wasn't such a busy site.
>> 
>> I suspect (but haven't tried) that for someone who can't easily change
>> their backup solution they'd get most of the benefits of git-native
>> transport by having their "rsync" sync refs, then objects, not the other
>> way around. Glob order dictates that most backup systems will do
>> objects, then refs (which will of course, at that point, refer to
>> nonexisting objects).
>> 
>> It's still not safe, you'll still be subject to races, but probably a
>> lot better in practice.
>
> It would be great if git provided a command to do a reliable incremental 
> backup, maybe it could copy things in the order that you mention?

I don't think we can or want to support this sort of thing ever, for the
same reason that you probably won't convince MySQL,PostgreSQL etc. that
they should support "cp -r" as a mode for backing up their live database
services.

I mean, there is the topic of git being lazy about fsync() etc, but even
if all of that were 100% solved you'd still get bad things if you picked
an arbitrary time to snapshot a running git directory, e.g. your
"master" branch might have a "master.lock" because it was in the middle
of an update.

If you used "fetch/clone/bundle" etc. to get the data no problem, but if
your snapshot happens then you'd need to manually clean that up, a
situation which in practice wouldn't persist, but would be persistent
with a snapshot approach.

> However, most people will want to use the backup system they have and not a 
> special git tool. Maybe git fsck should gain a switch that would rewind any 
> refs to an older point that is no broken (using reflogs)? That way, most 
> backups would just work and be rewound to the point at which the backup 
> started?

I think the main problem in the wild is not the inability of using a
special tool, but one of education. Most people wouldn't think of "cp
-r" as a first approach to say backing up a live mysql server, they'd
use mysqldump and the like.

But for some reason git is considered "not a database" enough that those
same people would just use rsync/tar/whatever, and are then surprised
when their data is corrupt or in some weird or inconsistent state...

Anyway, see also my just-posted:
https://lore.kernel.org/git/878s21wl4z.fsf@xxxxxxxxxxxxxxxxxxx/

I.e. I'm not saying "never use rsync", there's cases where that's fine,
but for a live "real" server I'd say solutions in that class shouldn't
be considered/actively migrated away from.