Re: [PATCH 1/7] bulk-checkin: rename 'state' variable and separate 'plugged' boolean

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 15, 2022 at 10:33 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>
> > From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
> >
> > Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Sorry, but I am confused.  The bulk-checkin infrastructure is there
> so that we can send many little objects into a single packfile
> instead of creating many little loose object files.  Everything we
> throw at object-file.c::index_stream() will be concatenated into the
> single packfile while we are "plugged" until we get "unplugged".
>

I noticed that you invented bulk-checkin back in 2011, but I don't think your
description matches what the code actually does.  index_bulk_checkin
is only called from index_stream, which is only called from index_fd. index_fd
goes down the index_bulk_checkin path for large files (512MB by default). It
looks like the effect of the 'plug/unplug' code is to allow multiple
large blobs to
go into a single packfile rather than each getting one getting its own separate
packfile.

> My understanding of what you are doing in this series is to still
> create many little loose object files, but avoid the overhead of
> having to fsync them individually.  And I am not sure how well the
> original idea behind the bulk-checkin infrastructure to avoid
> overhead of having to create many loose objects by creating a single
> packfile (and presumably having to fsync at the end, but that is
> just a single .pack file) with your goal of still creating many
> loose object files but synching them more efficiently.
>
> Is it just the new feature is piggybacking on the existing bulk
> checkin infrastructure, even though these two have nothing in
> common?
>

I think my new usage is congruent with the existing API, which seems
to be about combining multiple add operations into a large transaction,
where we can do some cleanup operations once we're finished. In the
preexisting code, the transaction is about adding a bunch of large objects
to a single pack file (while leaving small objects loose), and then completing
the packfile when the adds are finished.

---
On a side note, I've also been thinking about how we could use a packfile
approach as an alternative means to achieve faster addition of many small
objects. It's essentially what you stated above, where we'd send our
little objects
into a pack file. But to avoid frequent repacking overhead, we might
want to reuse
the 'latest' packfile across multiple Git invocations by appending
objects to it, with
an fsync on the file at the end.

We'd need sufficient padding between objects created by different Git
invocations to
ensure that previously synced data doesn't get disturbed by later
operations.  We'd
need to rewrite the pack indexes each time, but that's at least
derived metadata, so it
doesn't need to be fsynced. To make the pack indexes more
incrementally-updatable,
we might want to have the fanout table be checksummed, with
checksummed pointers to
leaf blocks. If we detect corruption during an index lookup, we could
recreate the index
from the packfile.

Essentially the above proposal is to move away from storing loose
objects in the filesystem
and instead to index the data within Git itself.

Thanks,
Neeraj



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux