V2 changes: * Change doc to indicate that only some repo updates are batched * Null and zero out control variables in do_batch_fsync under unplug_bulk_checkin * Make batch mode default on Windows. * Update the description for the initial patch that cleans up the bulk-checkin infrastructure. * Rebase onto 'seen' at 0cac37f38f9. --Original definition-- When core.fsync includes loose-object, we issue an fsync after every written object. For a 'git-add' or similar command that adds a lot of files to the repo, the costs of these fsyncs adds up. One major factor in this cost is the time it takes for the physical storage controller to flush its caches to durable media. This series takes advantage of the writeout-only mode of git_fsync to issue OS cache writebacks for all of the objects being added to the repository followed by a single fsync to a dummy file, which should trigger a filesystem log flush and storage controller cache flush. This mechanism is known to be safe on common Windows filesystems and expected to be safe on macOS. Some linux filesystems, such as XFS, will probably do the right thing as well. See [1] for previous discussion on the predecessor of this patch series. This series is important on Windows, where loose-objects are included in the fsync set by default in Git-For-Windows. In this series, I'm also setting the default mode for Windows to turn on loose object fsyncing with batch mode, so that we can get CI coverage of the actual git-for-windows configuration upstream. We still don't actually issue fsyncs for the test suite since GIT_TEST_FSYNC is set to 0, but we exercise all of the surrounding batch mode code. This work is based on 'seen' at . It's dependent on ns/core-fsyncmethod. [1] https://lore.kernel.org/git/2c1ddef6057157d85da74a7274e03eacf0374e45.1629856293.git.gitgitgadget@xxxxxxxxx/ Neeraj Singh (7): bulk-checkin: rename 'state' variable and separate 'plugged' boolean core.fsyncmethod: batched disk flushes for loose-objects update-index: use the bulk-checkin infrastructure unpack-objects: use the bulk-checkin infrastructure core.fsync: use batch mode and sync loose objects by default on Windows core.fsyncmethod: tests for batch mode core.fsyncmethod: performance tests for add and stash Documentation/config/core.txt | 7 +++ builtin/unpack-objects.c | 3 ++ builtin/update-index.c | 6 +++ bulk-checkin.c | 92 +++++++++++++++++++++++++++++++---- bulk-checkin.h | 2 + cache.h | 12 ++++- compat/mingw.h | 3 ++ config.c | 4 +- git-compat-util.h | 2 + object-file.c | 2 + t/lib-unique-files.sh | 36 ++++++++++++++ t/perf/p3700-add.sh | 59 ++++++++++++++++++++++ t/perf/p3900-stash.sh | 62 +++++++++++++++++++++++ t/perf/perf-lib.sh | 4 +- t/t3700-add.sh | 22 +++++++++ t/t3903-stash.sh | 17 +++++++ t/t5300-pack-object.sh | 32 +++++++----- 17 files changed, 340 insertions(+), 25 deletions(-) create mode 100644 t/lib-unique-files.sh create mode 100755 t/perf/p3700-add.sh create mode 100755 t/perf/p3900-stash.sh base-commit: 0cac37f38f94bb93550eb164b5d574cd96e23785 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1134%2Fneerajsi-msft%2Fns%2Fbatched-fsync-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1134/neerajsi-msft/ns/batched-fsync-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1134 Range-diff vs v1: 1: a77d02df626 ! 1: 9c2abd12bbb bulk-checkin: rename 'state' variable and separate 'plugged' boolean @@ Metadata ## Commit message ## bulk-checkin: rename 'state' variable and separate 'plugged' boolean - Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure. + This commit prepares for adding batch-fsync to the bulk-checkin + infrastructure. + + The bulk-checkin infrastructure is currently used to batch up addition + of large blobs to a packfile. When a blob is larger than + big_file_threshold, we unconditionally add it to a pack. If bulk + checkins are 'plugged', we allow multiple large blobs to be added to a + single pack until we reach the packfile size limit; otherwise, we simply + make a new packfile for each large blob. The 'unplug' call tells us when + the series of blob additions is done so that we can finish the packfiles + and make their objects available to subsequent operations. + + Stated another way, bulk-checkin allows callers to define a transaction + that adds multiple objects to the object database, where the object + database can optimize its internal operations within the transaction + boundary. + + Batched fsync will fit into bulk-checkin by taking advantage of the + plug/unplug functionality to determine the appropriate time to fsync + and make newly-added objects available in the primary object database. * Rename 'state' variable to 'bulk_checkin_state', since we will later be adding 'bulk_fsync_objdir'. This also makes the variable easier to 2: d38f20b4430 ! 2: 3ed1dcd9b9b core.fsyncmethod: batched disk flushes for loose-objects @@ Documentation/config/core.txt: core.fsyncMethod:: filesystem and storage hardware, data added to the repository may not be durable in the event of a system crash. This is the default mode on macOS. +* `batch` enables a mode that uses writeout-only flushes to stage multiple -+ updates in the disk writeback cache and then a single full fsync to trigger -+ the disk cache flush at the end of the operation. This mode is expected to ++ updates in the disk writeback cache and then does a single full fsync of ++ a dummy file to trigger the disk cache flush at the end of the operation. ++ Currently `batch` mode only applies to loose-object files. Other repository ++ data is made durable as if `fsync` was specified. This mode is expected to + be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems + and on Windows for repos stored on NTFS or ReFS filesystems. @@ bulk-checkin.c: clear_exit: + fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp)); + delete_tempfile(&temp); + strbuf_release(&temp_path); ++ needs_batch_fsync = 0; + } + -+ if (bulk_fsync_objdir) ++ if (bulk_fsync_objdir) { + tmp_objdir_migrate(bulk_fsync_objdir); ++ bulk_fsync_objdir = NULL; ++ } +} + static int already_written(struct bulk_checkin_state *state, struct object_id *oid) 3: b0480f0c814 = 3: 54797dbc520 update-index: use the bulk-checkin infrastructure 4: 99e3a61b919 = 4: 6662e2dae0f unpack-objects: use the bulk-checkin infrastructure 5: 4e56c58c8cb ! 5: 03bf591742a core.fsync: use batch mode and sync loose objects by default on Windows @@ Commit message Signed-off-by: Neeraj Singh <neerajsi@xxxxxxxxxxxxx> - change fsyncmethod to batch as well - ## cache.h ## @@ cache.h: enum fsync_component { FSYNC_COMPONENT_INDEX | \ 6: 88e47047d79 = 6: 1937746df47 core.fsyncmethod: tests for batch mode 7: 876741f1ef9 = 7: 624244078c7 core.fsyncmethod: performance tests for add and stash -- gitgitgadget