On Sat, Mar 31, 2012 at 10:19:54AM -0500, Neal Kreitzinger wrote: > >Note that there are other problem areas with big files that can be > >worked on, too. For example, some people want to store 100 gigabytes > >in a repository. > > I take it that you have in mind a 100G set of files comprised entirely > of big-files that cannot be logically separated into smaller submodules? Not exactly. Two scenarios I'm thinking of are: 1. You really have 100G of data in the current version that doesn't compress well (e.g., you are storing your music collection). You can't afford to store two copies on your laptop (because you have a fancy SSD, and 100G is expensive again). You need the working tree version, but it's OK to stream the repo version of a blob from the network when you actually need it (mostly "checkout", assuming you have marked the file as "-diff"). 2. You have a 100G repository, but only 10G in the most recent version (e.g., because you are doing game development and storing the media assets). You want your clones to be faster and take less space. You can do a shallow clone, but then you're never allowed to look at old history. Instead, it would be nice to clone all of the commits, trees, and small blobs, and then stream large blobs from the network as-needed (again, mostly "checkout"). > My understanding is that a main strategy for "big files" is to separate > your big-files logically into their own submodule(s) to keep them from > bogging down the not-big-file repo(s). That helps people who want to work on the not-big parts by not forcing them into the big parts (another solution would be partial clone, but more on that in a minute). But it doesn't help people who actually want to work on the big parts; they would still have to fetch the whole big-parts repository. For splitting the big-parts people from the non-big-parts people, there have been two suggestions: partial checkout (you have all the objects in the repo, but only checkout some of them) and partial clone (you don't have some of the objects in the repo). Partial checkout is a much easier problem, as it is mostly about marking index entries as "do not bother to check this out, and pretend that it is simply unmodified". Partial clone is much harder, because it violates git's usual reachability rules. During a fetch, a client will say "I have commit X", which the server can then assume means they have all of the ancestors of X, and all of the tree and blobs referenced by X and its ancestors. But if a client can say "yes, I have these objects, but I just don't want to get them because it's expensive", then partial checkout is sufficient. The non-big-parts people will clone, omitting the big objects, and then do a partial checkout (to avoid fetching the objects even once). Note that some protocol extension is still needed for the client to tell the server "don't bother including objects X, Y, and Z in the packfile; I'll get them from my alternate big-object repo". That can either be a list of objects, or it can simply be "don't bother with objects bigger than N". > >Because git is distributed, that means 100G in the repo database, > >and 100G in the working directory, for a total of 200G. > > I take it that you are implying that the 100G object-store size is due > to the notion that binary files cannot-be/are-not compressed well? In this case, yes. But you could easily tweak the numbers to be 100G and 150G. The point is that the data is stored twice, and even the compressed version may be big. > >People in this situation may want to be able to store part of the > >repository database in a network-accessible location, trading some > >of the convenience of being fully distributed for the space savings. > >So another project could be designing a network-based alternate > >object storage system. > > > I take it you are implying a local area network with users git repos > on workstations? Not necessarily. Obviously if you are doing a lot of active work on the big files, the faster your network, the better. But it could work at the internet scale, too, if you don't actually fetch the big files frequently (so part of a scheme like this would be making sure we avoid accessing big objects whenever we can; in practice, this is pretty easy, as git already tries to avoid accessing objects unnecessarily, because it's expensive even on the local end). You can also cache a certain number of fetched objects locally. Assuming there is some locality of the objects you ask about (e.g., because you are doing "git checkout" back and forth between two branches), this can help. > Some setups login to a linux server and have all their repos there. > The "alternate objects" does not need to network-based in that case. > It is "local", but local does not mean 20 people cloning the > alternate objects to their workstations. It means one copy of > alternate objects, and twenty repos referencing that one copy. Right. This is the same concept, except over the network. So people's working repositories are on their own workstations instead of a central server. You could even do it today by network-mounting a filesystem and pointing your alternates file at it. However, I think it's worth making git aware that the objects are on the network for a few reasons: 1. Git can be more careful about how it handles the objects, including when to fetch, when to stream, and when to cache. For example, you'd want to fetch the manifest of objects and cache it in your local repository, because you want fast lookups of "do I have this object". 2. Providing remote filesystems on an Internet scale is a management pain (and it's a pain for the user, too). My thought was that this would be implemented on top of http (the connection setup cost is negligible, since these objects would generally be large). 3. Usually alternate repositories are full repositories that meet the connectivity requirements (so you could run "git fsck" in them). But this is explicitly about taking just a few disconnected large blobs out of the repository and putting them elsewhere. So it needs a new set of tools for managing the upstream repository. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html