On Sat, Mar 26, 2022 at 8:34 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > > On Fri, Mar 25 2022, Neeraj Singh wrote: > > > On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason > > <avarab@xxxxxxxxx> wrote: > >> > >> > >> On Fri, Mar 25 2022, Neeraj Singh wrote: > >> > >> > On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason > >> > <avarab@xxxxxxxxx> wrote: > >> >> > >> >> > >> >> On Tue, Mar 15 2022, Neeraj Singh wrote: > >> >> > >> >> I know this is probably 80% my fault by egging you on about initially > >> >> adding the wildmatch() based thing you didn't go for. > >> >> > >> >> But having looked at this with fresh eyes quite deeply I really think > >> >> we're severely over-configuring things here: > >> >> > >> >> > +core.fsync:: > >> >> > + A comma-separated list of components of the repository that > >> >> > + should be hardened via the core.fsyncMethod when created or > >> >> > + modified. You can disable hardening of any component by > >> >> > + prefixing it with a '-'. Items that are not hardened may be > >> >> > + lost in the event of an unclean system shutdown. Unless you > >> >> > + have special requirements, it is recommended that you leave > >> >> > + this option empty or pick one of `committed`, `added`, > >> >> > + or `all`. > >> >> > ++ > >> >> > +When this configuration is encountered, the set of components starts with > >> >> > +the platform default value, disabled components are removed, and additional > >> >> > +components are added. `none` resets the state so that the platform default > >> >> > +is ignored. > >> >> > ++ > >> >> > +The empty string resets the fsync configuration to the platform > >> >> > +default. The default on most platforms is equivalent to > >> >> > +`core.fsync=committed,-loose-object`, which has good performance, > >> >> > +but risks losing recent work in the event of an unclean system shutdown. > >> >> > ++ > >> >> > +* `none` clears the set of fsynced components. > >> >> > +* `loose-object` hardens objects added to the repo in loose-object form. > >> >> > +* `pack` hardens objects added to the repo in packfile form. > >> >> > +* `pack-metadata` hardens packfile bitmaps and indexes. > >> >> > +* `commit-graph` hardens the commit graph file. > >> >> > +* `index` hardens the index when it is modified. > >> >> > +* `objects` is an aggregate option that is equivalent to > >> >> > + `loose-object,pack`. > >> >> > +* `derived-metadata` is an aggregate option that is equivalent to > >> >> > + `pack-metadata,commit-graph`. > >> >> > +* `committed` is an aggregate option that is currently equivalent to > >> >> > + `objects`. This mode sacrifices some performance to ensure that work > >> >> > + that is committed to the repository with `git commit` or similar commands > >> >> > + is hardened. > >> >> > +* `added` is an aggregate option that is currently equivalent to > >> >> > + `committed,index`. This mode sacrifices additional performance to > >> >> > + ensure that the results of commands like `git add` and similar operations > >> >> > + are hardened. > >> >> > +* `all` is an aggregate option that syncs all individual components above. > >> >> > + > >> >> > core.fsyncMethod:: > >> >> > A value indicating the strategy Git will use to harden repository data > >> >> > using fsync and related primitives. > >> >> > >> >> On top of my > >> >> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@xxxxxxxxx/ > >> >> which makes the tmp-objdir part of your not-in-next-just-seen follow-up > >> >> series configurable via "fsyncMethod.batch.quarantine" I really think we > >> >> should just go for something like the belwo patch (note that > >> >> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed > >> >> below. > >> >> > >> >> I.e. I think we should just do our default fsync() of everything, and > >> >> probably SOON make the fsync-ing of loose objects the default. Those who > >> >> care about performance will have "batch" (or "writeout-only"), which we > >> >> can have OS-specific detections for. > >> >> > >> >> But really, all of the rest of this is unduly boxing us into > >> >> overconfiguration that I think nobody really needs. > >> >> > >> > > >> > We've gone over this a few times already, but just wanted to state it > >> > again. The really detailed settings are really there for Git hosters > >> > like GitLab or GitHub. I'd be happy to remove the per-component > >> > core.fsync values from the documentation and leave just the ones we > >> > point the user to. > >> > >> I'm prettty sure (but Patrick knows more) that GitLab's plan for this is > >> to keep it at whatever the safest setting is, presumably GitHub's as > >> well (but I don't know at all on that front). > >> > >> >> If someone really needs this level of detail they can LD_PRELOAD > >> >> something to have fsync intercept fd's and paths, and act appropriately. > >> >> > >> >> Worse, as the RFC series I sent > >> >> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@xxxxxxxxx/) > >> >> shows we can and should "batch" up fsync() operations across these > >> >> configuration boundaries, which this level of configuration would seem > >> >> to preclude. > >> >> > >> >> Or, we'd need to explain why "core.fsync=loose-object" won't *actually* > >> >> call fsync() on a single loose object's fd under "batch" as I had to do > >> >> on top of this in > >> >> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@xxxxxxxxx/ > >> >> > >> > > >> > 99.9% of users don't care and won't look. The ones who do look deeper > >> > and understand the issues have source code and access to this ML > >> > discussion to understand why this works this way. > >> > >> Exactly, so we can hopefully have a simpler interface. > >> > >> >> The same is going to apply for almost all of the rest of these > >> >> configuration categories. > >> >> > >> >> I.e. a natural follow-up to e.g. batching across objects & index as I'm > >> >> doing in > >> >> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@xxxxxxxxx/ > >> >> is to do likewise for all the PACK-related stuff before we rename it > >> >> in-place. Or even have "git gc" issue only a single fsync() for all of > >> >> PACKs, their metadata files, commit-graph etc., and then rename() things > >> >> in-place as appropriate afterwards. > >> >> > >> >> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > >> >> index 365a12dc7ae..536238e209b 100644 > >> >> --- a/Documentation/config/core.txt > >> >> +++ b/Documentation/config/core.txt > >> >> @@ -548,49 +548,35 @@ core.whitespace:: > >> >> errors. The default tab width is 8. Allowed values are 1 to 63. > >> >> > >> >> core.fsync:: > >> >> - A comma-separated list of components of the repository that > >> >> - should be hardened via the core.fsyncMethod when created or > >> >> - modified. You can disable hardening of any component by > >> >> - prefixing it with a '-'. Items that are not hardened may be > >> >> - lost in the event of an unclean system shutdown. Unless you > >> >> - have special requirements, it is recommended that you leave > >> >> - this option empty or pick one of `committed`, `added`, > >> >> - or `all`. > >> >> -+ > >> >> -When this configuration is encountered, the set of components starts with > >> >> -the platform default value, disabled components are removed, and additional > >> >> -components are added. `none` resets the state so that the platform default > >> >> -is ignored. > >> >> -+ > >> >> -The empty string resets the fsync configuration to the platform > >> >> -default. The default on most platforms is equivalent to > >> >> -`core.fsync=committed,-loose-object`, which has good performance, > >> >> -but risks losing recent work in the event of an unclean system shutdown. > >> >> -+ > >> >> -* `none` clears the set of fsynced components. > >> >> -* `loose-object` hardens objects added to the repo in loose-object form. > >> >> -* `pack` hardens objects added to the repo in packfile form. > >> >> -* `pack-metadata` hardens packfile bitmaps and indexes. > >> >> -* `commit-graph` hardens the commit graph file. > >> >> -* `index` hardens the index when it is modified. > >> >> -* `objects` is an aggregate option that is equivalent to > >> >> - `loose-object,pack`. > >> >> -* `derived-metadata` is an aggregate option that is equivalent to > >> >> - `pack-metadata,commit-graph`. > >> >> -* `committed` is an aggregate option that is currently equivalent to > >> >> - `objects`. This mode sacrifices some performance to ensure that work > >> >> - that is committed to the repository with `git commit` or similar commands > >> >> - is hardened. > >> >> -* `added` is an aggregate option that is currently equivalent to > >> >> - `committed,index`. This mode sacrifices additional performance to > >> >> - ensure that the results of commands like `git add` and similar operations > >> >> - are hardened. > >> >> -* `all` is an aggregate option that syncs all individual components above. > >> >> + A boolen defaulting to `true`. To ensure data integrity git > >> >> + will fsync() its objects, index and refu updates etc. This can > >> >> + be set to `false` to disable `fsync()`-ing. > >> >> ++ > >> >> +Only set this to `false` if you know what you're doing, and are > >> >> +prepared to deal with data corruption. Valid use-cases include > >> >> +throwaway uses of repositories on ramdisks, one-off mass-imports > >> >> +followed by calling `sync(1)` etc. > >> >> ++ > >> >> +Note that the syncing of loose objects is currently excluded from > >> >> +`core.fsync=true`. To turn on all fsync-ing you'll need > >> >> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see > >> >> +`core.fsyncMethod=batch` below for a much faster alternative that's > >> >> +just as safe on various modern OS's. > >> >> ++ > >> >> +The default is in flux and may change in the future, in particular the > >> >> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting > >> >> +might soon default to `true`, and `core.fsyncMethod`'s default of > >> >> +`fsync` might default to a setting deemed to be safe on the local OS, > >> >> +suc has `batch` or `writeout-only` > >> >> > >> >> core.fsyncMethod:: > >> >> A value indicating the strategy Git will use to harden repository data > >> >> using fsync and related primitives. > >> >> + > >> >> +Defaults to `fsync`, but as discussed for `core.fsync` above might > >> >> +change to use one of the values below taking advantage of > >> >> +platform-specific "faster `fsync()`". > >> >> ++ > >> >> * `fsync` uses the fsync() system call or platform equivalents. > >> >> * `writeout-only` issues pagecache writeback requests, but depending on the > >> >> filesystem and storage hardware, data added to the repository may not be > >> >> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some > >> >> Linux setups. > >> >> + > >> >> Nowadays you should almost certainly want to use > >> >> -`core.fsync=loose-object` instead in combination with > >> >> -`core.fsyncMethod=bulk`, and possibly with > >> >> +`core.fsync=true` instead in combination with > >> >> +`core.fsyncMethod=batch`, and possibly with > >> >> `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux, > >> >> OSX, Windows) that gives you most of the performance benefit of > >> >> `core.fsyncObjectFiles=false` with all of the safety of the old > >> > > >> > I'm at the point where I don't want to endlessly revisit this discussion. > >> > >> Sorry, my intention isn't to frustrate you, but I do think it's > >> important to get this right. > >> > >> Particularly since this is now in "next", and we're getting closer to a > >> release. We can either talk about this now and decide on something, or > >> it'll be in a release, and then publicly documented promises will be > >> harder to back out of. > >> > >> I think your suggestion of just hiding the relevant documentation would > >> be a good band-aid solution to that. > >> > >> But I also think that given how I was altering this in my RFC series > >> that the premise of how this could be structured has been called into > >> question in a way that we didn't (or I don't recall) us having discussed > >> before. > >> > >> I.e. that we can say "sync loose, but not index", or "sync index, but > >> not loose" with this config schema. When with "bulk" we it really isn't > >> any more expensive to do both if one is true (even cheaper, actually). > >> > > > > I want to make a comment about the Index here. Syncing the index is > > strictly required for the "added" level of consistency, so that we > > don't lose stuff that leaves the work tree but is staged. But my > > Windows enlistment has an index that's 266MB, which would be painful > > to sync even with all the optimizations. Maybe with split-index, this > > wouldn't be so bad, but I just wanted to call out that some advanced > > users may really care about the configurability. > > So for that use-case you'd like to fsync the loose objects (if any), but > not the index? So the FS will "flush" up to the index, and then queue > the index for later syncing to platter? > > > But even in that case don't the settings need to be tied to one another, > because in the method=bulk sync=index && sync=!loose case wouldn't we be > syncing "loose" in any case? > > > As Git's various database implementations improve, the fsync stuff > > will hopefully be more optimal and self-tuning. But as that happens, > > Git could just start ignoring settings that lose meaning without tying > > anyones hands. > > Yeah that would alleviate most of my concerns here, but the docs aren't > saying anything like that. Since you added them & they just landed, do > you mind doing a small follow-up where we e.g. say that these new > settings are "EXPERIMENTAL" or whatever, and subject to drastic change? The doc is already pretty prescriptive. It has this line at the end of the first paragraph: "Unless you have special requirements, it is recommended that you leave this option empty or pick one of `committed`, `added`, or `all`." Those values are already designed to change as Git changes.