Re: [RFC PATCH] multi-pack-index: allow operating without pack files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 2021-08-22 at 20:34 -0400, Derrick Stolee wrote:
> On 8/20/2021 3:55 PM, Johannes Berg wrote:
> > Technically, multi-pack-index doesn't need pack files to exist,
> > but add_packed_git() today checks whether it exists or not.
> 
> Having a multi-pack-index is supposed to indicate that we have
> these objects in the objects/pack directory within the specified
> pack-files.

Yeah, so, like I tried to explain in the patch email, this is
(partially) for a non-git use case.

The way you could think of it is that I'm trying to make a local index
cache, so we know locally which objects we have, but the actual packs
and their contents are stored in some slow/expensive (remote) storage
system.

With the local cache, I'd like to understand whether or not I have a
given object (in particular to avoid storing it again if I see it), and
also I'd like to know _where_ it is (which pack and where in the pack)
so that I can access only parts, instead of downloading everything.

The server isn't necessarily able to run something 'smart', it might
just be S3 blob storage, but storing pack files, not individual objects.

> I understand your goal to relax a condition of the multi-pack-index
> file, but it's triggered by a flag during write and that choice
> isn't persisted into the file. There is no way for a later Git
> process to understand that the multi-pack-index doesn't actually
> guarantee object existence.

It does, since the pack files are recorded. I've tested this briefly,
and it doesn't really seem to have any problems with it.

While I understand the concern, I'd also envision that this tool
command-line option is never set when this is used on normal git
repositories, only when used in this special use case I tried explaining
above.

> And in a completely other side: one would think that including
> a pack-file in the multi-pack-index would allow deleting the .idx
> file, but there are a few reasons why we do not (including
> interactions with third-party tools).

Well, arguably you can always restore it from the pack file, so you can
always delete it ;) Not sure git is set up for that though. Also, even
old versions of git would probably choke since they don't read the
multi-pack-index.

> So, I'm not necessarily on board with this change unless
> something is added to the multi-pack-index file (such as a new
> version 2 and an optional chunk understood by that version) that
> tells future Git processes that the .pack files might not exist.
> I'm still not sure what Git should do about that other than stop
> reading the multi-pack-index and ignore its contents.

It seems to do that today, or at least try to find in the idx if it goes
looking for an object and the multi-pack-index references a pack file
that doesn't exist - and as Martin also pointed out, that really is
necessary for this to work with older versions of git.

> [snip comments on the code]

Fair enough. I think we need to figure out the conceptual question
though - whether or not this is something that's acceptable in git in
the first place.

Like I said, the intent isn't to break git repositories or to even use
this on a git repository that's also normally being read by git (even
though I don't see any problems with that, but my testing wasn't very
deep). The intent (and the reason for this being [RFC]) is to see if git
upstream would be amenable to such a change that lets other tools use
git's index - especially multi-pack-index - machinery to operate on such
a thing as the "index cache" I described above.

In particular, in bup's [1] use case, we currently support our own
"midx" format which is similar to multi-pack-index, but only contains
*object existence*, not *object location*. Thus, there are two problems:

   1. In a normal bup/git repository, when bup is used to write objects
      (it only ever writes packs), then the midx is also updated to
      allow bup itself to check for object existence without opening all
      the *.idx files, allowing it to deduplicate objects efficiently.
      In this case, bup always uses git to retrieve objects pack, and
      this leads to inefficiencies: git will not be able to use the midx
      files (and they don't contain object location anyway), but git
      will open all the *.idx files. If we ask it, it might maintain its
      own multi-pack-index.
      
      Obviously, in this mode, there's no reason to have this patch - we
      can just ask git to create the multi-pack-index and add code to
      bup to understand how to read it in order to check for object
      existence (for deduplication).
      
   2. However, I'm working on adding encrypted repositories to bup, and
      there the objects are stored differently - still in something like
      a pack file, but it's encrypted, in a way that's done at the
      object level.
      
      In this mode, bup maintains an object index in a clear-text cache,
      i.e. all the *.idx files, and currently adds its own midx to be
      able to efficiently check for object existence.
      
      However, in this case, bup cannot use git to access objects, since
      there's no real git directory, only a folder containing the *.idx
      files, the data is "elsewhere".
      
      Since the *.idx file format is still the same (no reason to change
      it), with this patch we can use git to efficiently (in both
      developer time, I don't need to rewrite it, and runtime, since
      it's in C not python) ask git to write the multi-pack-index that
      then bup can use to do both: a) check for object existence during
      further backups while adding new objects, and b) look for object
      location while accessing the repository for reading.

I completely understand that this is a non-git use case, and I'm
entirely happy for you to tell me to get lost, but I figured it was such
a small change I had to try, rather than rewriting the entire multi-
pack-index maintenance code for bup (or importing git's there).

Really for (2) I don't even care about using git's multi-pack-index,
since git never uses it, but I wanted to have the same between (1) and
(2), and if we want git - called on a bup/git repo - to have a multi-
pack-index, then we need git's format, of course.

The other hack I considered was to just create empty *.pack files for
each *.idx file in the index cache because then git is (currently) happy
to accept this, so it never really validates anything there even when
the pack files "exist", but that felt even worse and like something git
might break with just a little more validity checking.


So basically I had choices between:
   A. implement multi-pack-index in bup, to cover both (1) and (2)
   B. implement a new midx format in bup, to cover just (2)
   C. call unmodified git in bup, to cover just (1)
   D. this patch to cover both (1) and (2)
   E. create dummy *.pack files for the *.idx files in the cache, so
      that git will accept it.

Obviously, I can also do any combination of these, e.g. B & C, but this
was the least amount of work ;)


But really, it's up to you, I'm just asking. If you feel this doesn't
fit into git's machinery (despite being so simple), I'll just do
"something else", not sure which one I might do. Quite possibly A., if
only because then I can write tests with "git multi-index-pack verify"
:-)

johannes

[1] https://bup.github.io/, not to be confused with bupstash.io




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux