On Wed, Mar 30, 2022 at 10:11 AM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > "Neeraj Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > > Batched fsync will fit into bulk-checkin by taking advantage of the > > plug/unplug functionality to determine the appropriate time to fsync > > and make newly-added objects available in the primary object database. > > > > * Rename 'state' variable to 'bulk_checkin_state', since we will later > > be adding 'bulk_fsync_objdir'. This also makes the variable easier to > > find in the debugger, since the name is more unique. > > > > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate > > static variable. Doing this avoids resetting the variable in > > finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we > > seem to unintentionally disable the plugging functionality the first > > time a new packfile must be created due to packfile size limits. While > > disabling the plugging state only results in suboptimal behavior for > > the current code, it would be fatal for the bulk-fsync functionality > > later in this patch series. > > Paraphrasing to make sure I understand your reasoning here... > > In the "plug and then perform as many changes to the repository and > finally unplug" flow, before or after this series, the "perform" > step in the middle is unaware of which "bulk_checkin_state" instance > is being used to keep track of what is done to optimize by deferring > some operations until the "unplug" time. So bulk_checkin_state is > not there to allow us to create multiple instances of it, pass them > around to different sequences of "plug, perform, unplug". Each of > its members is inherently a singleton, so in the extreme, we could > turn these members into separate file-scope global variables if we > wanted to. The "plugged" bit happens to be the only one getting > ejected by this patch, because it is inconvenient to "clear" other > members otherwise. > > Is that what is going on? > More or less. The current state is all about creating a single packfile for multiple large objects. That packfile is a singleton today (we could have an alternate implementation where there's a separate packfile per thread in the future, so it's not inherent to the API). We want to do this if the top-level caller is okay with the state being invisible until the "finish" call, and that is conveyed by the "plugged" flag. > If it is, I am mildly opposed to the flow of thought, from at least > two reasons. It makes it hard for the next developer to decide if > the new members they are adding should be in or out of the struct. > > More importantly, I think the call of finish_bulk_checkin() we make > in deflate_to_pack() you found (and there may possibly other places > that we do so; I didn't check) may not appear to be a bug in the > original context, but it already is a bug. And when we change the > semantics of plug-unplug to be more "transaction-like", it becomes a > more serious bug, as you said. > > There is NO reason to end the ongoing transaction there inside the > while() loop that tries to limit the size of the packfile being > used. We may want to flush the "packfile part", which may have been > almost synonymous to the entirety of bulk_checkin_state, but as you > found out, the "plugged" bit is *outside* the "packfile part", and > that makes it a bug to call finish_bulk_checkin() from there. > > We should add a new function, flush_bulk_checking_packfile(), to > flush only the packfile part of the bulk_checkin_state without > affecting other things---the "plugged" bit is the only one in the > current code before this series, but it does not have to stay to be > so I'm happy to rename the packfile related stuff to end with _packfile to make it clear that all of that state and functionality is related to batching of packfile additions. So from this patch: s/bulk_checkin_state/bulk_checkin_packfile and s/finish_bulk_checkin/finish_bulk_checkin_packfile. My new state will be bulk_fsync_* (as it is already). Any future ODB-related state can go here too (I'm imagining a future log-structured 'new objects pack' that we can append to for adding small objects, similar to the bulk_checkin_packfile but allowing appends from multiple git invocations). > When you start plugging the loose ref transactions, you may > find it handy (this is me handwaving) to have a list of refs that > you may have to do something at "unplug" time kept in the struct, > and you do not want deflate_to_pack() affecting the ongoing > "plugged" ref operations by calling finish_bulk_checkin() and > reinitializing that list, for example. > I don't believe ref transactions will go through this part of the infrastructure. Refs already have a good transaction system (that partly inspired this rebranding, after I saw how Peter implemented batch ref fsync). I expect this area will remain all about the ODB as a subsystem that can enlist in a larger repo->transaction. So a top-level Git command might initiate a repo transaction, which would internally initiate an ODB transaction, index transaction, and ref transaction. The repo transaction would support flushing each of the subtransactions with an optimal number of fsyncs. > And then we should examine existing calls to finish_bulk_checkin() > and replace the ones that should not be finishing, i.e. the ones > that wanted "flush" but called "finish". Sure. I can fix this, which will only change this file. The only case of "finishing" would be in unplug_bulk_checkin / end_odb_transaction. Thanks, Neeraj