Re: [PATCH v3 01/11] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 24 Mar 2022 17:10:24 +0100

On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
>
> Make it clearer in the naming and documentation of the plug_bulk_checkin
> and unplug_bulk_checkin APIs that they can be thought of as
> a "transaction" to optimize operations on the object database.
>
> Signed-off-by: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
> ---
>  builtin/add.c  |  4 ++--
>  bulk-checkin.c |  4 ++--
>  bulk-checkin.h | 14 ++++++++++++--
>  3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index 3ffb86a4338..9bf37ceae8e 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  		string_list_clear(&only_match_skip_worktree, 0);
>  	}
>  
> -	plug_bulk_checkin();
> +	begin_odb_transaction();
>  
>  	if (add_renormalize)
>  		exit_status |= renormalize_tracked_files(&pathspec, flags);
> @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  
>  	if (chmod_arg && pathspec.nr)
>  		exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
> -	unplug_bulk_checkin();
> +	end_odb_transaction();

Aside from anything else we've (dis)agreed on, I found this part really
odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought
the plug_bulk_checkin() was something that originated with this series
which adds the "bulk" mode.

But no, on second inspection it's a thing Junio added a long time ago so
that in this case we "stream to N pack" where we'd otherwise add N loose
objects.

Which, and I think Junio brought this up in an earlier round, but I
didn't fully understand that at the time makes this whole thing quite
odd to me.

So first, shouldn't we add this begin_odb_transaction() as a new thing?
I.e. surely wanting to do that object target redirection within a given
begin/end "scope" should be orthagonal to how fsync() happens within
that "scope", though in this case that happens to correspond.

And secondly, per the commit message and comment when it was added in
(568508e7657 (bulk-checkin: replace fast-import based implementation,
2011-10-28)) is it something we need *for that purpose* with the series
to unpack-objects without malloc()ing the size of the blob[1].

And, if so and orthagonal to that: If we know how to either stream N
objects to a PACK (as fast-import does), *and* we now (or SOON) know how
to stream loose objects without using size(blob) amounts of memory,
doesn't the "optimize fsync()" rather want to make use of the
stream-to-pack approach?

I.e. have you tried for the caseses where we create say 1k objects for
"git stash" tried to stream those to a pack? How does that compare (both
with/without the fsync changes).

I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat
(and I think can use it in either case, to defer object syncs until the
"index" or "ref" sync, as my RFC does) I worry that we're adding a bunch
of configuration and complexity for something that:

 1. Ultimately isn't all that important, as already for part of it we
    can mostly configure it away. I.e. "git-unpack-objects" v.s. writing
    a pack, cf. transfer.unpackLimit)
 2. We don't have #1 for "add" and "update-index", but if we stream to
    packs there is there any remaining benefit in practice?

1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@xxxxxxxxx/
2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@xxxxxxxxxxxxxxxxxxx/