Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-09-30 at 07:57:17, Patrick Steinhardt wrote:
> But there are still a couple of pieces missing in the bigger puzzle:
> 
>   - How would a client know to omit certain objects? Right now it only
>     knows that there are promisor remotes, but it doesn't know that it
>     e.g. should omit every blob larger than X megabytes. The answer
>     could of course be that the client should just know to do a partial
>     clone by themselves.

It would be helpful to have some sort of protocol v2 feature that says
that a partial clone (of whatever sort) is recommended and let honouring
that be a config flag.  Otherwise, you're going to have a bunch of users
who try to download every giant object in the repository when they don't
need to.

Git LFS has the advantage that this is the default behaviour, which is
really valuable.

>   - Storing those large objects locally is still expensive. We had
>     discussions in the past where such objects could be stored
>     uncompressed to stop wasting compute here. At GitLab, we're thinking
>     about the ability to use rolling hash functions to chunk such big
>     objects into smaller parts to also allow for somewhat efficient
>     deduplication. We're also thinking about how to make the overall ODB
>     pluggable such that we can eventually make it more scalable in this
>     context. But that's of course thinking into the future quite a bit.

Git LFS has a `git lfs dedup` command, which takes the files in the
working tree and creates a copy using the copy-on-write functionality in
the operating system and file system to avoid duplicating them.  There
are certainly some users who simply cannot afford to store multiple
copies of the file system (say, because their repository is 500 GB), and
this is important functionality for them.

Note that this doesn't work for all file systems.  It does for APFS on
macOS, XFS and Btrfs on Linux, and ReFS on Windows, but not HFS+, ext4,
or NTFS, which lack copy-on-write functionality.

We'd probably need to add an extension for uncompressed objects for
this, since it's a repository format change, but it shouldn't be hard to
do.

In Git LFS, it's also possible to share a set of objects across
repositories although one must be careful not to prune them.  We already
have that through alternates, so I don't think we're lacking anything
there.

>   - Local repositories would likely want to prune large objects that
>     have not been accessed for a while to eventually regain some storage
>     space.

Git LFS has a `git lfs prune` command for this as well.  It does have to
be run manually, though.

> I think chipping away the problems one by one is fine. But it would be
> nice to draw something like a "big picture" of where we eventually want
> to end up at and how all the parts connect with each other to form a
> viable native replacement for Git LFS.

I think a native replacement would be a valuable feature.  Part of the
essential component is going to be a way to handle this gracefully
during pushes, since part of the goal of Git LFS is to get large blobs
off the main server storage where they tend to make repacks extremely
expensive and into an external store.  Without that, it's unlikely that
this feature is going to be viable on the server side.  GitHub doesn't
allow large blobs for exactly that reason, so we'd want some way to
store them outside the main repository but still have the repo think
they were present.

One idea I had about this was pluggable storage backends, which might be
a nice feature to add via a dynamically loaded shared library.  In
addition, this seems like the kind of feature that one might like to use
Rust for, since it probably will involve HTTP code, and generally people
like doing that less in C (I do, at least).

> Also Cc'ing brian, who likely has a thing or two to say about this :)

I certainly have thought about this a lot.  I will say that I've stepped
down from being one of the Git LFS maintainers (endless supply of work,
not nearly enough time), but I am still familiar with the architecture
of the project.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux