On Thu, Mar 24, 2022 at 9:24 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > > On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx> > > > > Make it clearer in the naming and documentation of the plug_bulk_checkin > > and unplug_bulk_checkin APIs that they can be thought of as > > a "transaction" to optimize operations on the object database. > > > > Signed-off-by: Neeraj Singh <neerajsi@xxxxxxxxxxxxx> > > --- > > builtin/add.c | 4 ++-- > > bulk-checkin.c | 4 ++-- > > bulk-checkin.h | 14 ++++++++++++-- > > 3 files changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/builtin/add.c b/builtin/add.c > > index 3ffb86a4338..9bf37ceae8e 100644 > > --- a/builtin/add.c > > +++ b/builtin/add.c > > @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > string_list_clear(&only_match_skip_worktree, 0); > > } > > > > - plug_bulk_checkin(); > > + begin_odb_transaction(); > > > > if (add_renormalize) > > exit_status |= renormalize_tracked_files(&pathspec, flags); > > @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > > > if (chmod_arg && pathspec.nr) > > exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); > > - unplug_bulk_checkin(); > > + end_odb_transaction(); > > Aside from anything else we've (dis)agreed on, I found this part really > odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought > the plug_bulk_checkin() was something that originated with this series > which adds the "bulk" mode. > > But no, on second inspection it's a thing Junio added a long time ago so > that in this case we "stream to N pack" where we'd otherwise add N loose > objects. > > Which, and I think Junio brought this up in an earlier round, but I > didn't fully understand that at the time makes this whole thing quite > odd to me. > > So first, shouldn't we add this begin_odb_transaction() as a new thing? > I.e. surely wanting to do that object target redirection within a given > begin/end "scope" should be orthagonal to how fsync() happens within > that "scope", though in this case that happens to correspond. > > And secondly, per the commit message and comment when it was added in > (568508e7657 (bulk-checkin: replace fast-import based implementation, > 2011-10-28)) is it something we need *for that purpose* with the series > to unpack-objects without malloc()ing the size of the blob[1]. > The original change seems to be about optimizing addition of successive large blobs to the ODB when we know we have a large batch. It's a batch-mode optimization for the ODB, similar to my patch series, just targeting large blobs rather than small blobs/trees. It also has the same property that the added data is "invisible" until the transaction ends. > And, if so and orthagonal to that: If we know how to either stream N > objects to a PACK (as fast-import does), *and* we now (or SOON) know how > to stream loose objects without using size(blob) amounts of memory, > doesn't the "optimize fsync()" rather want to make use of the > stream-to-pack approach? > > I.e. have you tried for the caseses where we create say 1k objects for > "git stash" tried to stream those to a pack? How does that compare (both > with/without the fsync changes). > > I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat > (and I think can use it in either case, to defer object syncs until the > "index" or "ref" sync, as my RFC does) I worry that we're adding a bunch > of configuration and complexity for something that: > > 1. Ultimately isn't all that important, as already for part of it we > can mostly configure it away. I.e. "git-unpack-objects" v.s. writing > a pack, cf. transfer.unpackLimit) > 2. We don't have #1 for "add" and "update-index", but if we stream to > packs there is there any remaining benefit in practice? > > 1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@xxxxxxxxx/ > 2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@xxxxxxxxxxxxxxxxxxx/ Stream to pack is a good idea. But I think we'd want a way to append to the most recent pack so that we don't explode the number of packs, which seems to impose a linear cost on ODB operations, at least to load up the indexes. I think this is orthogonal and we can always change the meaning of batch mode to use a pack mechanism when such a mechanism is ready. Thanks, Neeraj