Re: [PATCH v4 2/4] core.fsync: introduce granular fsync control

Patrick Steinhardt <ps@xxxxxx> · Mon, 14 Feb 2022 08:04:58 +0100

On Fri, Feb 11, 2022 at 03:15:20PM -0800, Junio C Hamano wrote:
> Neeraj Singh <nksingh85@xxxxxxxxx> writes:
> 
> > In practice, almost all users have core.fsyncObjectFiles set to the
> > platform default, which is 'false' everywhere besides Windows.  So at
> > minimum, we have to take default to mean that we maintain behavior no
> > weaker than the current version of Git, otherwise users will start
> > losing their data.
> 
> Would they?   If they think platform default is performant and safe
> enough for their use, as long as our adjustment is out outrageously
> more dangerous or less performant, I do not think "no weaker than"
> is a strict requirement.  If we were overly conservative in some
> areas than the "platform default", making it less conservative in
> those areas to match the looseness of other areas should be OK and
> vice versa.
> 
> > One path to get to your suggestion from the current patch series would
> > be to remove the component-specific options and only provide aggregate
> > options.  Alternatively, we could just not document the
> > component-specific options and leave them available to be people who
> > read source code. So if I rename the aggregate options in terms of
> > 'levels of durability', and only document those, would that be
> > acceptable?
> 
> In any case, if others who reviewed the series in the past are happy
> with the "two knobs" approach and are willing to jump in to help new
> users who will be confused with one knob too many, I actually am OK
> with the series that I called "overly complex".  Let me let them
> weigh in before I can answer that question.
> 
> Thanks.

I wonder whether it makes sense to distinguish client- and server-side
requirements. While we probably want to "do the right thing" on the
client-side by default so that we don't have to teach users that "We may
use data in some cases on some systems, but not in other cases on other
systems by default." On the server-side folks who implement the Git
hosting are typically a lot more knowledgeable with regards to how Git
behaves, and it can be expected of them to dig a lot deeper than we can
and should reasonably expect from a user.

One point I really care about and that I think is extremely important in
the context of both client and server is data consistency. While it is
bad to lose data, the reality is that it can simply happen when systems
hit exceptional cases like a hard-reset. But what we really must not let
happen is that as a result, a repository gets corrupted because of this
hard reset. In the context of client-side repositories a user will be
left to wonder how to fix the repository, while on the server-side we
need to go in and somehow repair the repository fast or otherwise users
will come to us and complain their repository stopped working.

Our current defaults can end up with repository corruption though. We
don't sync object files before renaming them into place by default, and
furthermore we don't even give a knob to fsync loose references before
renaming them. On gitlab.com we enable "core.fsyncObjectFiles" and thus
to the best of my knowledge never hit a corrupted ODB until now. But
what we do regularly hit is corrupted references because there is no
knob to fsync them.

Furthermore, I really think that the advice in git-config(1) with
regards to loose syncing loose objects that "This is a total waste of
time and effort" is wrong. Filesystem developers have repeatedly stated
that for proper atomicity guarantees, a file must be synced to disk
before renaming it into place [1][2][3]. So from the perspective of the
people who write Linux-based filesystems our application is "broken"
right now, even though there are mechanisms in place which try to work
around our brokenness by using heuristics to detect what we're doing.

To summarize my take: while the degree of durability may be something
that's up for discussions, I think that the current defaults for
atomicity are bad for users because they can and do lead to repository
corruption.

Patrick

[1]: https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
[2]: https://btrfs.wiki.kernel.org/index.php/FAQ (What are the crash guarantees of overwrite-by-rename)
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/ext4.rst (see auto_da_alloc)
Attachment:
signature.asc

Description: PGP signature