do we have too much fsync() configuration in 'next'? (was: [PATCH v7] core.fsync: documentation and user-friendly aggregate options)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 15 2022, Neeraj Singh wrote:

I know this is probably 80% my fault by egging you on about initially
adding the wildmatch() based thing you didn't go for.

But having looked at this with fresh eyes quite deeply I really think
we're severely over-configuring things here:

> +core.fsync::
> +	A comma-separated list of components of the repository that
> +	should be hardened via the core.fsyncMethod when created or
> +	modified.  You can disable hardening of any component by
> +	prefixing it with a '-'.  Items that are not hardened may be
> +	lost in the event of an unclean	system shutdown. Unless you
> +	have special requirements, it is recommended that you leave
> +	this option empty or pick one of `committed`, `added`,
> +	or `all`.
> ++
> +When this configuration is encountered, the set of components starts with
> +the platform default value, disabled components are removed, and additional
> +components are added. `none` resets the state so that the platform default
> +is ignored.
> ++
> +The empty string resets the fsync configuration to the platform
> +default. The default on most platforms is equivalent to
> +`core.fsync=committed,-loose-object`, which has good performance,
> +but risks losing recent work in the event of an unclean system shutdown.
> ++
> +* `none` clears the set of fsynced components.
> +* `loose-object` hardens objects added to the repo in loose-object form.
> +* `pack` hardens objects added to the repo in packfile form.
> +* `pack-metadata` hardens packfile bitmaps and indexes.
> +* `commit-graph` hardens the commit graph file.
> +* `index` hardens the index when it is modified.
> +* `objects` is an aggregate option that is equivalent to
> +  `loose-object,pack`.
> +* `derived-metadata` is an aggregate option that is equivalent to
> +  `pack-metadata,commit-graph`.
> +* `committed` is an aggregate option that is currently equivalent to
> +  `objects`. This mode sacrifices some performance to ensure that work
> +  that is committed to the repository with `git commit` or similar commands
> +  is hardened.
> +* `added` is an aggregate option that is currently equivalent to
> +  `committed,index`. This mode sacrifices additional performance to
> +  ensure that the results of commands like `git add` and similar operations
> +  are hardened.
> +* `all` is an aggregate option that syncs all individual components above.
> +
>  core.fsyncMethod::
>  	A value indicating the strategy Git will use to harden repository data
>  	using fsync and related primitives.

On top of my
https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@xxxxxxxxx/
which makes the tmp-objdir part of your not-in-next-just-seen follow-up
series configurable via "fsyncMethod.batch.quarantine" I really think we
should just go for something like the belwo patch (note that
misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
below.

I.e. I think we should just do our default fsync() of everything, and
probably SOON make the fsync-ing of loose objects the default. Those who
care about performance will have "batch" (or "writeout-only"), which we
can have OS-specific detections for.

But really, all of the rest of this is unduly boxing us into
overconfiguration that I think nobody really needs.

If someone really needs this level of detail they can LD_PRELOAD
something to have fsync intercept fd's and paths, and act appropriately.

Worse, as the RFC series I sent
(https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@xxxxxxxxx/)
shows we can and should "batch" up fsync() operations across these
configuration boundaries, which this level of configuration would seem
to preclude.

Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
call fsync() on a single loose object's fd under "batch" as I had to do
on top of this in
https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@xxxxxxxxx/

The same is going to apply for almost all of the rest of these
configuration categories.

I.e. a natural follow-up to e.g. batching across objects & index as I'm
doing in
https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@xxxxxxxxx/
is to do likewise for all the PACK-related stuff before we rename it
in-place. Or even have "git gc" issue only a single fsync() for all of
PACKs, their metadata files, commit-graph etc., and then rename() things
in-place as appropriate afterwards.

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 365a12dc7ae..536238e209b 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -548,49 +548,35 @@ core.whitespace::
   errors. The default tab width is 8. Allowed values are 1 to 63.
 
 core.fsync::
-	A comma-separated list of components of the repository that
-	should be hardened via the core.fsyncMethod when created or
-	modified.  You can disable hardening of any component by
-	prefixing it with a '-'.  Items that are not hardened may be
-	lost in the event of an unclean	system shutdown. Unless you
-	have special requirements, it is recommended that you leave
-	this option empty or pick one of `committed`, `added`,
-	or `all`.
-+
-When this configuration is encountered, the set of components starts with
-the platform default value, disabled components are removed, and additional
-components are added. `none` resets the state so that the platform default
-is ignored.
-+
-The empty string resets the fsync configuration to the platform
-default. The default on most platforms is equivalent to
-`core.fsync=committed,-loose-object`, which has good performance,
-but risks losing recent work in the event of an unclean system shutdown.
-+
-* `none` clears the set of fsynced components.
-* `loose-object` hardens objects added to the repo in loose-object form.
-* `pack` hardens objects added to the repo in packfile form.
-* `pack-metadata` hardens packfile bitmaps and indexes.
-* `commit-graph` hardens the commit graph file.
-* `index` hardens the index when it is modified.
-* `objects` is an aggregate option that is equivalent to
-  `loose-object,pack`.
-* `derived-metadata` is an aggregate option that is equivalent to
-  `pack-metadata,commit-graph`.
-* `committed` is an aggregate option that is currently equivalent to
-  `objects`. This mode sacrifices some performance to ensure that work
-  that is committed to the repository with `git commit` or similar commands
-  is hardened.
-* `added` is an aggregate option that is currently equivalent to
-  `committed,index`. This mode sacrifices additional performance to
-  ensure that the results of commands like `git add` and similar operations
-  are hardened.
-* `all` is an aggregate option that syncs all individual components above.
+	A boolen defaulting to `true`. To ensure data integrity git
+	will fsync() its objects, index and refu updates etc. This can
+	be set to `false` to disable `fsync()`-ing.
++
+Only set this to `false` if you know what you're doing, and are
+prepared to deal with data corruption. Valid use-cases include
+throwaway uses of repositories on ramdisks, one-off mass-imports
+followed by calling `sync(1)` etc.
++
+Note that the syncing of loose objects is currently excluded from
+`core.fsync=true`. To turn on all fsync-ing you'll need
+`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
+`core.fsyncMethod=batch` below for a much faster alternative that's
+just as safe on various modern OS's.
++
+The default is in flux and may change in the future, in particular the
+equivalent of the already-deprecated `core.fsyncObjectFiles` setting
+might soon default to `true`, and `core.fsyncMethod`'s default of
+`fsync` might default to a setting deemed to be safe on the local OS,
+suc has `batch` or `writeout-only`
 
 core.fsyncMethod::
 	A value indicating the strategy Git will use to harden repository data
 	using fsync and related primitives.
 +
+Defaults to `fsync`, but as discussed for `core.fsync` above might
+change to use one of the values below taking advantage of
+platform-specific "faster `fsync()`".
++
 * `fsync` uses the fsync() system call or platform equivalents.
 * `writeout-only` issues pagecache writeback requests, but depending on the
   filesystem and storage hardware, data added to the repository may not be
@@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
 Linux setups.
 +
 Nowadays you should almost certainly want to use
-`core.fsync=loose-object` instead in combination with
-`core.fsyncMethod=bulk`, and possibly with
+`core.fsync=true` instead in combination with
+`core.fsyncMethod=batch`, and possibly with
 `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
 OSX, Windows) that gives you most of the performance benefit of
 `core.fsyncObjectFiles=false` with all of the safety of the old



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux