On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > > On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: > > > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with > > 'true', 'false', and 'batch'. This makes using old and new versions of > > git with 'batch' mode a little trickier, but hopefully people will > > generally be moving forward in versions. > > > > [1] See > > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@xxxxxxxxx/ > > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ > > I really think leaving that in-place is just being unnecessarily > cavalier. There's a lot of mixed-version environments where git is > deployed in, and we almost never break the configuration in this way (I > think in the past always by mistake). > In this case it's easy to avoid it, and coming up with a less narrow > config model[1] seems like a good idea in any case to unify the various > outstanding work in this area. > > More generally on this series, per the thread ending in [2] I really My primary goal in all of these changes is to move git-for-windows over to a default of batch fsync so that it can get closer to other platforms in performance of 'git add' while still retaining the same level of data integrity. I'm hoping that most end-users are just sticking to defaults here. I'm happy to change the configuration schema again if there's a consensus from the Git community that backwards-compatibility of the configuration is actually important to someone. Also, if we're doing a deeper rethink of the fsync configuration (as prompted by this work and Eric Wong's and Patrick Steinhardts work), do we want to retain a mode where we fsync some parts of the persistent repo data but not others? If we add fsyncing of the index in addition to the refs, I believe we would have covered all of the critical data structures that would be needed to find the data that a user has added to the repo if they complete a series of git commands and then experience a system crash. > don't get why we have code like this: > > @@ -503,10 +504,12 @@ static void unpack_all(void) > if (!quiet) > progress = start_progress(_("Unpacking objects"), nr_objects); > CALLOC_ARRAY(obj_list, nr_objects); > + plug_bulk_checkin(); > for (i = 0; i < nr_objects; i++) { > unpack_one(i); > display_progress(progress, i + 1); > } > + unplug_bulk_checkin(); > stop_progress(&progress); > > if (delta_list) > > As opposed to doing an fsync on the last object we're > processing. I.e. why do we need the step of intentionally making the > objects unavailable in the tmp-objdir, and creating a "cookie" file to > sync at the start/end, as opposed to fsyncing on the last file (which > we're writing out anyway). > > 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@xxxxxxxxxxxxxxxxxxx/ > 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ It's important to not expose an object's final name until its contents have been fsynced to disk. We want to ensure that wherever we crash, we won't have a loose object that Git may later try to open where the filename doesn't match the content hash. I believe it's okay for a given OID to be missing, since a later command could recreate it, but an object with a wrong hash looks like it would persist until we do a git-fsck. I thought about figuring out how to sync the last object rather than some random "cookie" file, but it wasn't clear to me how I'd figure out which object is actually last from library code in a way that doesn't burden each command with somehow figuring out its last object and communicating that. The 'cookie' approach seems to lead to a cleaner interface for callers. Thanks, Neeraj