Re: [PATCH v2 2/3] core.fsync: introduce granular fsync control

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 08 Dec 2021 11:05:30 +0100

On Tue, Dec 07 2021, Neeraj Singh wrote:

> On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
>>
>>
>> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
>>
>> > From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
>> >
>> > This commit introduces the `core.fsync` configuration
>> > knob which can be used to control how components of the
>> > repository are made durable on disk.
>> >
>> > This setting allows future extensibility of the list of
>> > syncable components:
>> > * We issue a warning rather than an error for unrecognized
>> >   components, so new configs can be used with old Git versions.
>>
>> Looks good!
>>
>> > * We support negation, so users can choose one of the default
>> >   aggregate options and then remove components that they don't
>> >   want. The user would then harden any new components added in
>> >   a Git version update.
>>
>> I think this config schema makes sense, but just a (I think important)
>> comment on the "how" not "what" of it. It's really much better to define
>> config as:
>>
>>     [some]
>>     key = value
>>     key = value2
>>
>> Than:
>>
>>     [some]
>>     key = value,value2
>>
>> The reason is that "git config" has good support for working with
>> multi-valued stuff, so you can do e.g.:
>>
>>     git config --get-all -z some.key
>>
>> And you can easily (re)set such config e.g. with --replace-all etc., but
>> for comma-delimited you (and users) need to do all that work themselves.
>>
>> Similarly instead of:
>>
>>     some.key = want-this
>>     some.key = -not-this
>>     some.key = but-want-this
>>
>> I think it's better to just have two lists, one inclusive another
>> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
>> "transfer.hideRefs"
>>
>> Which would mean:
>>
>>     core.fsync = want-this
>>     core.fsyncExcludes = -not-this
>>
>> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
>> suggestion on making this easier for users & the implementation.
>>
>
> Maybe there's some way to handle this I'm unaware of, but a
> disadvantage of your multi-valued config proposal is that it's harder,
> for example, for a per-repo config store to reasonably override a
> per-user config store.  With the configuration scheme as-is, I can
> have a per-user setting like `core.fsync=all` which covers my typical
> repos, but then have a maintainer repo with a private setting of
> `core.fsync=none` to speed up cases where I'm mostly working with
> other people's changes that are backed up in email or server-side
> repos.  The latter setting conveniently overrides the former setting
> in all aspects.

Even if you turn just your comma-delimited proposal into a list proposal
can't we just say that the last one wins? Then it won't matter what cmae
before, you'd specify "core.fsync=none" in your local .git/config.

But this is also a general issue with a bunch of things in git's config
space. I'd rather see us use the list-based values and just come up with
some general way to reset them that works for all keys, rather than
regretting having comma-delimited values that'll be harder to work with
& parse, which will be a legacy wart if/when we come up with a way to
say "reset all previous settings".

> Also, with the core.fsync and core.fsyncExcludes, how would you spell
> "don't sync anything"? Would you still have the aggregate options.?

    core.fsyncExcludes = *

I.e. the same as the core.fsync=none above, anyway I still like the
wildcard thing below a bit more...

>> > This also supports the common request of doing absolutely no
>> > fysncing with the `core.fsync=none` value, which is expected
>> > to make the test suite faster.
>>
>> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
>> so we'll accept "false", "off", "no" like most other such config?
>
> Junio's previous feedback when discussing batch mode [1] was to offer
> less flexibility when parsing new values of these configuration
> options. I agree with his statement that "making machine-readable
> tokens be spelled in different ways is a 'disease'."  I'd like to
> leave this as-is so that the documentation can clearly state the exact
> set of allowable values.
>
> [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/

I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
the "maybe bool" is a stanard pattern we use.

I don't think we'd call one of these 0, off, no or false etc. to avoid
confusion, so then you can use git_parse_maybe_...()

>>
>> > Signed-off-by: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
>> > ---
>> >  Documentation/config/core.txt | 27 +++++++++----
>> >  builtin/fast-import.c         |  2 +-
>> >  builtin/index-pack.c          |  4 +-
>> >  builtin/pack-objects.c        |  8 ++--
>> >  bulk-checkin.c                |  5 ++-
>> >  cache.h                       | 39 +++++++++++++++++-
>> >  commit-graph.c                |  3 +-
>> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
>> >  csum-file.c                   |  5 ++-
>> >  csum-file.h                   |  3 +-
>> >  environment.c                 |  1 +
>> >  midx.c                        |  3 +-
>> >  object-file.c                 |  3 +-
>> >  pack-bitmap-write.c           |  3 +-
>> >  pack-write.c                  | 13 +++---
>> >  read-cache.c                  |  2 +-
>> >  16 files changed, 164 insertions(+), 33 deletions(-)
>> >
>> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
>> > index dbb134f7136..4f1747ec871 100644
>> > --- a/Documentation/config/core.txt
>> > +++ b/Documentation/config/core.txt
>> > @@ -547,6 +547,25 @@ core.whitespace::
>> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
>> >    errors. The default tab width is 8. Allowed values are 1 to 63.
>> >
>> > +core.fsync::
>> > +     A comma-separated list of parts of the repository which should be
>> > +     hardened via the core.fsyncMethod when created or modified. You can
>> > +     disable hardening of any component by prefixing it with a '-'. Later
>> > +     items take precedence over earlier ones in the list. For example,
>> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
>> > +     metadata." Items that are not hardened may be lost in the event of an
>> > +     unclean system shutdown.
>> > ++
>> > +* `none` disables fsync completely. This must be specified alone.
>> > +* `loose-object` hardens objects added to the repo in loose-object form.
>> > +* `pack` hardens objects added to the repo in packfile form.
>> > +* `pack-metadata` hardens packfile bitmaps and indexes.
>> > +* `commit-graph` hardens the commit graph file.
>> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
>> > +  `pack-metadata`, and `commit-graph`.
>> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
>> > +* `all` is an aggregate option that syncs all individual components above.
>> > +
>>
>> It's probably a *bit* more work to set up, but I wonder if this wouldn't
>> be simpler if we just said (and this is partially going against what I
>> noted above):
>>
>> == BEGIN DOC
>>
>> core.fsync is a multi-value config variable where each item is a
>> pathspec that'll get matched the same way as 'git-ls-files' et al.
>>
>> When we sync pretend that a path like .git/objects/de/adbeef... is
>> relative to the top-level of the git
>> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
>>
>> You can then supply a list of wildcards and exclusions to configure
>> syncing.  or "false", "off" etc. to turn it off. These are synonymous
>> with:
>>
>>     ; same as "false"
>>     core.fsync = ":!*"
>>
>> Or:
>>
>>     ; same as "true"
>>     core.fsync = "*"
>>
>> Or, to selectively sync some things and not others:
>>
>>     ;; Sync objects, but not "info"
>>     core.fsync = ":!objects/info/**"
>>     core.fsync = "objects/**"
>>
>> See gitrepository-layout(5) for details about what sort of paths you
>> might be expected to match. Not all paths listed there will go through
>> this mechanism (e.g. currently objects do, but nothing to do with config
>> does).
>>
>> We can and will match this against "fake paths", e.g. when writing out
>> packs we may match against just the string "objects/pack", we're not
>> going to re-check if every packfile we're writing matches your globs,
>> ditto for loose objects. Be reasonable!
>>
>> This metharism is intended as a shorthand that provides some flexibility
>> when fsyncing, while not forcing git to come up with labels for all
>> paths the git dir, or to support crazyness like "objects/de/adbeef*"
>>
>> More paths may be added or removed in the future, and we make no
>> promises that we won't move things around, so if in doubt use
>> e.g. "true" or a wide pattern match like "objects/**". When in doubt
>> stick to the golden path of examples provided in this documentation.
>>
>> == END DOC
>>
>>
>> It's a tad more complex to set up, but I wonder if that isn't worth
>> it. It nicely gets around any current and future issues of deciding what
>> labels such as "loose-object" etc. to pick, as well as slotting into an
>> existing method of doing exclude/include lists.
>>
>
> I think this proposal is a lot of complexity to avoid coming up with a
> new name for syncable things as they are added to Git.  A path based
> mechanism makes it hard to document for the (advanced) user what the
> full set of things is and how it might change from release to release.
> I think the current core.fsync scheme is a bit easier to understand,
> query, and extend.

We document it in gitrepository-layout(5). Yeah it has some
disadvantages, but one advantage is that you could make the
composability easy. I.e. if last exclude wins then a setting of:

    core.fsync = ":!*"
    core.fsync = "objects/**"

Would reset all previous matches & only match objects/**.

>> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> > index 857be7826f3..916c55d6ce9 100644
>> > --- a/builtin/pack-objects.c
>> > +++ b/builtin/pack-objects.c
>> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
>> >                * If so, rewrite it like in fast-import
>> >                */
>> >               if (pack_to_stdout) {
>> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
>> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>>
>> Not really related to this per-se, but since you're touching the API
>> everything goes through I wonder if callers should just always try to
>> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
>> tries to flush stdout, or catch the fd at that lower level.
>>
>> Or maybe there's a good reason for this...
>
> It's platform dependent, but I'd expect fsync would do something for
> pipes or stdout redirected to a file.  In these cases we really don't
> want to fsync since we have no idea what we're talking to and we're
> potentially worsening performance for probably no benefit.

Yeah maybe we should just leave it be.

I'd think the C library returning EINVAL would be a trivial performance
cost though.

It just seemed odd to hardcode assumptions about what can and can't be
synced when the POSIX defined function will also tell us that.

Anyway...

>> > [...]
>> > +/*
>> > + * These values are used to help identify parts of a repository to fsync.
>> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
>> > + * repository and so shouldn't be fsynced.
>> > + */
>> > +enum fsync_component {
>> > +     FSYNC_COMPONENT_NONE                    = 0,
>>
>> I haven't read ahead much but in most other such cases we don't define
>> the "= 0", just start at 1<<0, then check the flags elsewhere...
>>
>> > +static const struct fsync_component_entry {
>> > +     const char *name;
>> > +     enum fsync_component component_bits;
>> > +} fsync_component_table[] = {
>> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
>> > +     { "pack", FSYNC_COMPONENT_PACK },
>> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
>> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
>> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
>> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
>> > +     { "all", FSYNC_COMPONENTS_ALL },
>> > +};
>> > +
>> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
>> > +{
>> > +     enum fsync_component output = 0;
>> > +
>> > +     if (!strcmp(string, "none"))
>> > +             return output;
>> > +
>> > +     while (string) {
>> > +             int i;
>> > +             size_t len;
>> > +             const char *ep;
>> > +             int negated = 0;
>> > +             int found = 0;
>> > +
>> > +             string = string + strspn(string, ", \t\n\r");
>>
>> Aside from the "use a list" isn't this hardcoding some windows-specific
>> assumptions with \n\r? Maybe not...
>
> I shamelessly stole this code from parse_whitespace_rule. I thought
> about making a helper to be called by both functions, but the amount
> of state going into and out of the wrapper via arguments was
> substantial and seemed to negate the benefit of deduplication.

FWIW string_list_split() is easier to work with in those cases, or at
least I think so...