[PATCH v2 0/7] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects

"Neeraj K. Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> · Sun, 20 Mar 2022 07:15:53 +0000

V2 changes:

 * Change doc to indicate that only some repo updates are batched
 * Null and zero out control variables in do_batch_fsync under
   unplug_bulk_checkin
 * Make batch mode default on Windows.
 * Update the description for the initial patch that cleans up the
   bulk-checkin infrastructure.
 * Rebase onto 'seen' at 0cac37f38f9.

--Original definition-- When core.fsync includes loose-object, we issue an
fsync after every written object. For a 'git-add' or similar command that
adds a lot of files to the repo, the costs of these fsyncs adds up. One
major factor in this cost is the time it takes for the physical storage
controller to flush its caches to durable media.

This series takes advantage of the writeout-only mode of git_fsync to issue
OS cache writebacks for all of the objects being added to the repository
followed by a single fsync to a dummy file, which should trigger a
filesystem log flush and storage controller cache flush. This mechanism is
known to be safe on common Windows filesystems and expected to be safe on
macOS. Some linux filesystems, such as XFS, will probably do the right thing
as well. See [1] for previous discussion on the predecessor of this patch
series.

This series is important on Windows, where loose-objects are included in the
fsync set by default in Git-For-Windows. In this series, I'm also setting
the default mode for Windows to turn on loose object fsyncing with batch
mode, so that we can get CI coverage of the actual git-for-windows
configuration upstream. We still don't actually issue fsyncs for the test
suite since GIT_TEST_FSYNC is set to 0, but we exercise all of the
surrounding batch mode code.

This work is based on 'seen' at . It's dependent on ns/core-fsyncmethod.

[1]
https://lore.kernel.org/git/2c1ddef6057157d85da74a7274e03eacf0374e45.1629856293.git.gitgitgadget@xxxxxxxxx/

Neeraj Singh (7):
  bulk-checkin: rename 'state' variable and separate 'plugged' boolean
  core.fsyncmethod: batched disk flushes for loose-objects
  update-index: use the bulk-checkin infrastructure
  unpack-objects: use the bulk-checkin infrastructure
  core.fsync: use batch mode and sync loose objects by default on
    Windows
  core.fsyncmethod: tests for batch mode
  core.fsyncmethod: performance tests for add and stash

 Documentation/config/core.txt |  7 +++
 builtin/unpack-objects.c      |  3 ++
 builtin/update-index.c        |  6 +++
 bulk-checkin.c                | 92 +++++++++++++++++++++++++++++++----
 bulk-checkin.h                |  2 +
 cache.h                       | 12 ++++-
 compat/mingw.h                |  3 ++
 config.c                      |  4 +-
 git-compat-util.h             |  2 +
 object-file.c                 |  2 +
 t/lib-unique-files.sh         | 36 ++++++++++++++
 t/perf/p3700-add.sh           | 59 ++++++++++++++++++++++
 t/perf/p3900-stash.sh         | 62 +++++++++++++++++++++++
 t/perf/perf-lib.sh            |  4 +-
 t/t3700-add.sh                | 22 +++++++++
 t/t3903-stash.sh              | 17 +++++++
 t/t5300-pack-object.sh        | 32 +++++++-----
 17 files changed, 340 insertions(+), 25 deletions(-)
 create mode 100644 t/lib-unique-files.sh
 create mode 100755 t/perf/p3700-add.sh
 create mode 100755 t/perf/p3900-stash.sh

base-commit: 0cac37f38f94bb93550eb164b5d574cd96e23785
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1134%2Fneerajsi-msft%2Fns%2Fbatched-fsync-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1134/neerajsi-msft/ns/batched-fsync-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1134

Range-diff vs v1:

 1:  a77d02df626 ! 1:  9c2abd12bbb bulk-checkin: rename 'state' variable and separate 'plugged' boolean
     @@ Metadata
       ## Commit message ##
          bulk-checkin: rename 'state' variable and separate 'plugged' boolean

     -    Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
     +    This commit prepares for adding batch-fsync to the bulk-checkin
     +    infrastructure.
     +
     +    The bulk-checkin infrastructure is currently used to batch up addition
     +    of large blobs to a packfile. When a blob is larger than
     +    big_file_threshold, we unconditionally add it to a pack. If bulk
     +    checkins are 'plugged', we allow multiple large blobs to be added to a
     +    single pack until we reach the packfile size limit; otherwise, we simply
     +    make a new packfile for each large blob. The 'unplug' call tells us when
     +    the series of blob additions is done so that we can finish the packfiles
     +    and make their objects available to subsequent operations.
     +
     +    Stated another way, bulk-checkin allows callers to define a transaction
     +    that adds multiple objects to the object database, where the object
     +    database can optimize its internal operations within the transaction
     +    boundary.
     +
     +    Batched fsync will fit into bulk-checkin by taking advantage of the
     +    plug/unplug functionality to determine the appropriate time to fsync
     +    and make newly-added objects available in the primary object database.

          * Rename 'state' variable to 'bulk_checkin_state', since we will later
            be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
 2:  d38f20b4430 ! 2:  3ed1dcd9b9b core.fsyncmethod: batched disk flushes for loose-objects
     @@ Documentation/config/core.txt: core.fsyncMethod::
         filesystem and storage hardware, data added to the repository may not be
         durable in the event of a system crash. This is the default mode on macOS.
      +* `batch` enables a mode that uses writeout-only flushes to stage multiple
     -+  updates in the disk writeback cache and then a single full fsync to trigger
     -+  the disk cache flush at the end of the operation. This mode is expected to
     ++  updates in the disk writeback cache and then does a single full fsync of
     ++  a dummy file to trigger the disk cache flush at the end of the operation.
     ++  Currently `batch` mode only applies to loose-object files. Other repository
     ++  data is made durable as if `fsync` was specified. This mode is expected to
      +  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
      +  and on Windows for repos stored on NTFS or ReFS filesystems.

     @@ bulk-checkin.c: clear_exit:
      +		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
      +		delete_tempfile(&temp);
      +		strbuf_release(&temp_path);
     ++		needs_batch_fsync = 0;
      +	}
      +
     -+	if (bulk_fsync_objdir)
     ++	if (bulk_fsync_objdir) {
      +		tmp_objdir_migrate(bulk_fsync_objdir);
     ++		bulk_fsync_objdir = NULL;
     ++	}
      +}
      +
       static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
 3:  b0480f0c814 = 3:  54797dbc520 update-index: use the bulk-checkin infrastructure
 4:  99e3a61b919 = 4:  6662e2dae0f unpack-objects: use the bulk-checkin infrastructure
 5:  4e56c58c8cb ! 5:  03bf591742a core.fsync: use batch mode and sync loose objects by default on Windows
     @@ Commit message

          Signed-off-by: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>

     -    change fsyncmethod to batch as well
     -
       ## cache.h ##
      @@ cache.h: enum fsync_component {
       			      FSYNC_COMPONENT_INDEX | \
 6:  88e47047d79 = 6:  1937746df47 core.fsyncmethod: tests for batch mode
 7:  876741f1ef9 = 7:  624244078c7 core.fsyncmethod: performance tests for add and stash

-- 
gitgitgadget