Christian Couder <christian.couder@xxxxxxxxx> writes: > After cloning with --filter=<filter-spec>, for example to avoid > getting unneeded large files on a user machine, it's possible > that some of these large files still get fetched for some reasons > (like checking out old branches) over time. > > In this case the repo size could grow too much for no good reason and a > way to filter out some objects would be useful to remove the unneeded > large files. Makes sense. If we repack without these objects, when the repository has a promisor remote, we should be able to rely on that remote to supply them on demand, once we need them again, no? > Deleting objects right away could corrupt a repo though,... Hmph, could you elaborate why it is the case? Isn't it the whole point to have promisor remote and use a lazy clone with the --filter option, so that objects that _ought_ to exist from connectivity's point of view _are_ allowed to be missing because the promisor promises to make them available on-demand? Side note: I think I know the answer. While trying to remove UNNEEDED large files, doing so may discard NEEDED large files when done carelessly (e.g. the file may have been created locally and haven't been pushed back). But (1) if that is the problem, perhaps we should be more careful in the first place? (2) if it inherently is impossible to tell which ones are unneeded reliably, the reason why it is impossible, and the reason why "try sifting into two bins, one that we _think_ are unneeded and another for the rest, and verify what we _thought_ are unneeded are all available from the promisor remote" is the best we can do, must be described, I think. > ... so it might be > better to put those objects into a separate packfile instead of > deleting them. The separate pack could then be removed after checking > that all the objects in it are still available on a promisor remote it > can access. Surely, sifting the objects into two bins (i.e. those that we wouldn't have received if we cloned from the promisor remote just now, which are prunable, and those that we cannot lose because the promisor remote would not have them, e.g. we created them and have not pushed them to the remote yet) without removing anything would be safe, but if the result of such sifting must be verified, doesn't it indicate that the sifting step was buggy or misdesigned? It does not sound like a very good justification to save them in a separate packfile. It does smell somewhat similar to the cruft packs but not really (the choice over there is between exploding to loose and keeping in a pack, and never involves loss of objects). > Also splitting a packfile into 2 packs depending on a filter could be > useful in other usecases. For example some large blobs might take a lot > of precious space on fast storage while they are rarely accessed, and > it could make sense to move them in a separate cheaper, though slower, > storage. This one, outside the context of partial clone client, does make tons of sense. I guess what I suspect is that this option, while it would be very useful for the "in other usecases" scenario above, may not become all that useful in the "our lazy clone got bloated and we want to trim objects we know we can retrieve from the promisor remote again if necessary" scenario, until the repack machinery learns to use an extra piece of information (namely "these are objects that we can fetch from the promisor remote") at the same time. > This commit implements a new `--filter=<filter-spec>` option in > `git repack` that moves filtered out objects into a separate pack. > > This is done by reading filtered out objects from `git pack-objects`'s > output and piping them into a separate `git pack-objects` process that > will put them into a separate packfile. So, for example, you may say "no blobs" in the filter, and while packing the local repository with the filter, resulting in a pack that exclude all blobs, we will learn what blob objects we did not pack into that packfile. We can pack them into a separate one, and most of the blobs are what we could retrieve again from the promisor remote, but some of the blobs are what we locally created ourselves and haven't pushed back to the promisor remote yet. Now what? My earlier suspicion that this mechanism may not be all that useful for the "slim bloated lazy clone" comes from that I cannot think of a good answer to this "Now what?" question---my naive solution would involve enumerating the objects in that "separate packfile" that is a mixture of precious ones and expendable ones, and then learning which ones are precious, and creating a new pack that is a subset of that "separate packfile" with only the precious ones. But if I do so, I do not think we need this new mechanism that seems to go only the half-way.