Re: [PATCH v5 01/14] bulk-checkin: rename 'state' variable and separate 'plugged' boolean

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 30 Mar 2022 10:11:06 -0700

"Neeraj Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> Batched fsync will fit into bulk-checkin by taking advantage of the
> plug/unplug functionality to determine the appropriate time to fsync
> and make newly-added objects available in the primary object database.
>
> * Rename 'state' variable to 'bulk_checkin_state', since we will later
>   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
>   find in the debugger, since the name is more unique.
>
> * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
>   static variable. Doing this avoids resetting the variable in
>   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
>   seem to unintentionally disable the plugging functionality the first
>   time a new packfile must be created due to packfile size limits. While
>   disabling the plugging state only results in suboptimal behavior for
>   the current code, it would be fatal for the bulk-fsync functionality
>   later in this patch series.

Paraphrasing to make sure I understand your reasoning here...

In the "plug and then perform as many changes to the repository and
finally unplug" flow, before or after this series, the "perform"
step in the middle is unaware of which "bulk_checkin_state" instance
is being used to keep track of what is done to optimize by deferring
some operations until the "unplug" time.  So bulk_checkin_state is
not there to allow us to create multiple instances of it, pass them
around to different sequences of "plug, perform, unplug".  Each of
its members is inherently a singleton, so in the extreme, we could
turn these members into separate file-scope global variables if we
wanted to.  The "plugged" bit happens to be the only one getting
ejected by this patch, because it is inconvenient to "clear" other
members otherwise.

Is that what is going on?

If it is, I am mildly opposed to the flow of thought, from at least
two reasons.  It makes it hard for the next developer to decide if
the new members they are adding should be in or out of the struct.

More importantly, I think the call of finish_bulk_checkin() we make
in deflate_to_pack() you found (and there may possibly other places
that we do so; I didn't check) may not appear to be a bug in the
original context, but it already is a bug.  And when we change the
semantics of plug-unplug to be more "transaction-like", it becomes a
more serious bug, as you said.

There is NO reason to end the ongoing transaction there inside the
while() loop that tries to limit the size of the packfile being
used.  We may want to flush the "packfile part", which may have been
almost synonymous to the entirety of bulk_checkin_state, but as you
found out, the "plugged" bit is *outside* the "packfile part", and
that makes it a bug to call finish_bulk_checkin() from there.

We should add a new function, flush_bulk_checking_packfile(), to
flush only the packfile part of the bulk_checkin_state without
affecting other things---the "plugged" bit is the only one in the
current code before this series, but it does not have to stay to be
so.  When you start plugging the loose ref transactions, you may
find it handy (this is me handwaving) to have a list of refs that
you may have to do something at "unplug" time kept in the struct,
and you do not want deflate_to_pack() affecting the ongoing
"plugged" ref operations by calling finish_bulk_checkin() and
reinitializing that list, for example.

And then we should examine existing calls to finish_bulk_checkin()
and replace the ones that should not be finishing, i.e. the ones
that wanted "flush" but called "finish".