(Presenter: Christian Couder, Notetaker: Jonathan Nieder) * Idea: Git LFS has some downsides * Not integrated into Git, that's a problem in itself * Not easy to change decisions after the affect about what blobs to offload into LFS storage * So I started work some years ago on multiple promisor remotes as an alternative to Git LFS * Works! Requires some pieces * Filtering objects when repacking (git repack --filter, due to be merged hopefully soon) * I'm curious about issues related to Git LFS - what leads people not to use Git LFS and to do things in other, less efficient ways? * Choices * We can discuss details of a demo I worked on a few years ago * We can discuss Git LFS, how it works, and how we can do better * brian: Sounds like this is a mostly server-side improvement. How does this work on the client side for avoiding to need old versions of huge files? * Christian: On the client side, you can get those files when you need them (using partial clone), and repack --filter allows you to remove your local copy when you don't need them any more * There could be more options and commands to manage that kind of removal * Terry: with multiple promisor remotes, does gc write the large files as their own separate packfiles? What does the setup look like in practice? * Christian: You can do that. But you can also use a remote helper to access the remotes where the large files live. Such a cache server can be a plain http server hosting the large files, and the remote helper can know how to do a basic HTTP GET or RANGE request to get that file. * It can also work if the separate remote can be a git remote, specialized in handling large files. * Terry: So it can behave more like an LFS server, but as a native part of the git protocol. How flexible is it? * Christian: yes. Remote helpers can be scripts, they don't need to know a lot of things when they're just being used to get a few objects. * Jonathan Tan: is it important for this use case that the server serve regular files instead of git packfiles? * Christian: not so important, but it can be useful because some people may want to access their large objects in different ways. As they're large, it's expensive to store them; using the same server to store them for all purposes can make things less expensive. E.g. "just stick the file on Google Drive". * Taylor: in concept, this seems like a sensible direction. My concern would be immaturity of partial clone client behavior in these multiple-promisor scenarios * I don't think we have a lot of these users at GitHub. Have others had heavy use of partial clone? Have there been heavy issues on the client side? * Terry: Within the Android world, partial clone is heavily used by users and CI/CD and it's working well. * jrnieder: Two qualifications to add, we've been using it with blob filters and not tree filters. Haven't been using multiple promisor remotes. * Patrick: What's nice about LFS is that it's able to easily offload objects to a CDN. Reduce strain on the Git server itself. We might need a protocol addition here to redirect to a CDN. * Jonathan Tan: if we have a protocol addition (server-side option for blob-only fetch or something), we can use a remote helper to do the appropriate logic, not necessarily involving a git server * The issue, though, is that Git expects packfiles, as the way it stores things in its object store. * As long as the CDN supports serving packfiles, this would all be doable using current Git. * If the file format differs, may need more work. * jrn: Going back to Terry's question on the distinction between this and using an LFS server. One key diff is that with git LFS, is that the identifier is not the object ID, it's some other hash. Are there any other fundamental difference? * Christian: With git LFS if you want some blobs to be stored with LFS and they're not stored in LFS anymore you have to rewrite the history. * Using the git object ID gives you that flexibility * brian: One thing Git LFS has that Git doesn't is deduping * On macOS and Windows and btrfs on Linux, having only one underlying copy of the file * That's possible because we store the file uncompressed * That's a feature some people would like to have some time. Not out of the question to do in Git, would require a change to how objects are stored in the git object store * jrn: Is anyone using the demonstrated setup? * Christian: Doesn't seem so. It was considered interesting when demoed in GitLab. * Jonathan Tan: is the COW thing brian mentioned part of what this would be intended to support? * Christian: Ultimately that would be possible. * brian: To replace Git LFS, you need the ability to store uncompressed objects in the git object store. E.g. game textures. Avoids waste of CPU and lets you use reflinks (ioctl to share extents). * Patrick: objects need the header prefix to denote the object type. * brian: Yes, you'd need the blobs + metadata. That's part of what Git LFS gives us within GitHub, avoiding having to spend CPU on compressing these large objects to serve to the user. * jrn: Going back to the discussion with multiple promisors. When people turn on multiple promisors by mistake, the level of flexibility has been a problem. This causes a lot of failed/slow requests - git is very optimistic and tries to fetch objects from everywhere. This suggests the approach that Jonathan suggested, where the helper is responsible for choosing where to get objects from, it might help mitigate these issues. * Christian: yes * Minh: can the server say "here are most of the objects you asked for, but these other objects I'd encourage you to get from elsewhere"? * Christian: you can configure the same promisor remote on the server. If the client doesn't use the promisor remote and only contacts the main server, the server will contact the promisor remote, get the object, and send it to the client. It's not very efficient, but it works. Another downside is that if this happens, that object from the promisor remote is now also on the server, so you need to remove it if you don't want to keep it there. * Minh: it seems someone has to pack the object with the header and compute the git blob id for it, which is itself expensive * Christian: if the promisor remote is a regular git server, then yes, the objects will be compressed in git packfile format. But if it's a plain HTTP server and you access with a helper, it doesn't need to. But of course, if the objects are ever fetched by the main server, then it's in packfile or loose object format there.