On Tue, Mar 15, 2022 at 10:33 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > "Neeraj Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > > From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx> > > > > Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure. > > > > * Rename 'state' variable to 'bulk_checkin_state', since we will later > > be adding 'bulk_fsync_objdir'. This also makes the variable easier to > > find in the debugger, since the name is more unique. > > > > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate > > static variable. Doing this avoids resetting the variable in > > finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we > > seem to unintentionally disable the plugging functionality the first > > time a new packfile must be created due to packfile size limits. While > > disabling the plugging state only results in suboptimal behavior for > > the current code, it would be fatal for the bulk-fsync functionality > > later in this patch series. > > Sorry, but I am confused. The bulk-checkin infrastructure is there > so that we can send many little objects into a single packfile > instead of creating many little loose object files. Everything we > throw at object-file.c::index_stream() will be concatenated into the > single packfile while we are "plugged" until we get "unplugged". > I noticed that you invented bulk-checkin back in 2011, but I don't think your description matches what the code actually does. index_bulk_checkin is only called from index_stream, which is only called from index_fd. index_fd goes down the index_bulk_checkin path for large files (512MB by default). It looks like the effect of the 'plug/unplug' code is to allow multiple large blobs to go into a single packfile rather than each getting one getting its own separate packfile. > My understanding of what you are doing in this series is to still > create many little loose object files, but avoid the overhead of > having to fsync them individually. And I am not sure how well the > original idea behind the bulk-checkin infrastructure to avoid > overhead of having to create many loose objects by creating a single > packfile (and presumably having to fsync at the end, but that is > just a single .pack file) with your goal of still creating many > loose object files but synching them more efficiently. > > Is it just the new feature is piggybacking on the existing bulk > checkin infrastructure, even though these two have nothing in > common? > I think my new usage is congruent with the existing API, which seems to be about combining multiple add operations into a large transaction, where we can do some cleanup operations once we're finished. In the preexisting code, the transaction is about adding a bunch of large objects to a single pack file (while leaving small objects loose), and then completing the packfile when the adds are finished. --- On a side note, I've also been thinking about how we could use a packfile approach as an alternative means to achieve faster addition of many small objects. It's essentially what you stated above, where we'd send our little objects into a pack file. But to avoid frequent repacking overhead, we might want to reuse the 'latest' packfile across multiple Git invocations by appending objects to it, with an fsync on the file at the end. We'd need sufficient padding between objects created by different Git invocations to ensure that previously synced data doesn't get disturbed by later operations. We'd need to rewrite the pack indexes each time, but that's at least derived metadata, so it doesn't need to be fsynced. To make the pack indexes more incrementally-updatable, we might want to have the fanout table be checksummed, with checksummed pointers to leaf blocks. If we detect corruption during an index lookup, we could recreate the index from the packfile. Essentially the above proposal is to move away from storing loose objects in the filesystem and instead to index the data within Git itself. Thanks, Neeraj