On 2024-09-30 at 07:57:17, Patrick Steinhardt wrote: > But there are still a couple of pieces missing in the bigger puzzle: > > - How would a client know to omit certain objects? Right now it only > knows that there are promisor remotes, but it doesn't know that it > e.g. should omit every blob larger than X megabytes. The answer > could of course be that the client should just know to do a partial > clone by themselves. It would be helpful to have some sort of protocol v2 feature that says that a partial clone (of whatever sort) is recommended and let honouring that be a config flag. Otherwise, you're going to have a bunch of users who try to download every giant object in the repository when they don't need to. Git LFS has the advantage that this is the default behaviour, which is really valuable. > - Storing those large objects locally is still expensive. We had > discussions in the past where such objects could be stored > uncompressed to stop wasting compute here. At GitLab, we're thinking > about the ability to use rolling hash functions to chunk such big > objects into smaller parts to also allow for somewhat efficient > deduplication. We're also thinking about how to make the overall ODB > pluggable such that we can eventually make it more scalable in this > context. But that's of course thinking into the future quite a bit. Git LFS has a `git lfs dedup` command, which takes the files in the working tree and creates a copy using the copy-on-write functionality in the operating system and file system to avoid duplicating them. There are certainly some users who simply cannot afford to store multiple copies of the file system (say, because their repository is 500 GB), and this is important functionality for them. Note that this doesn't work for all file systems. It does for APFS on macOS, XFS and Btrfs on Linux, and ReFS on Windows, but not HFS+, ext4, or NTFS, which lack copy-on-write functionality. We'd probably need to add an extension for uncompressed objects for this, since it's a repository format change, but it shouldn't be hard to do. In Git LFS, it's also possible to share a set of objects across repositories although one must be careful not to prune them. We already have that through alternates, so I don't think we're lacking anything there. > - Local repositories would likely want to prune large objects that > have not been accessed for a while to eventually regain some storage > space. Git LFS has a `git lfs prune` command for this as well. It does have to be run manually, though. > I think chipping away the problems one by one is fine. But it would be > nice to draw something like a "big picture" of where we eventually want > to end up at and how all the parts connect with each other to form a > viable native replacement for Git LFS. I think a native replacement would be a valuable feature. Part of the essential component is going to be a way to handle this gracefully during pushes, since part of the goal of Git LFS is to get large blobs off the main server storage where they tend to make repacks extremely expensive and into an external store. Without that, it's unlikely that this feature is going to be viable on the server side. GitHub doesn't allow large blobs for exactly that reason, so we'd want some way to store them outside the main repository but still have the repo think they were present. One idea I had about this was pluggable storage backends, which might be a nice feature to add via a dynamically loaded shared library. In addition, this seems like the kind of feature that one might like to use Rust for, since it probably will involve HTTP code, and generally people like doing that less in C (I do, at least). > Also Cc'ing brian, who likely has a thing or two to say about this :) I certainly have thought about this a lot. I will say that I've stepped down from being one of the Git LFS maintainers (endless supply of work, not nearly enough time), but I am still familiar with the architecture of the project. -- brian m. carlson (they/them or he/him) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature