On Sun, Jul 22, 2018 at 07:48:31AM +0200, Christian Couder wrote: > I kept Peff as the author and took the liberty to add his > Signed-off-by to all the patches. Yes, that's fine. This is a topic I've been meaning to upstream for a long time, so I'm happy to see it progressing in that direction. A few reasons I've been hesitant: - this produces worse packs for people cloning. The packs we serve at GitHub for heavily-forked repositories can be significantly larger than what you could get by repacking in isolation. I think this is inherent in the scheme (we're losing some delta opportunities). But I think it's also made worse because the delta window gets clogged with candidates that are forbidden by the island config. Repacking with a big --window helps (and doesn't take as long as it otherwise might because we can reject some object pairs based on islands before doing any computation on the content). I also haven't done much modeling on how this scheme progresses over time. Imagine users tend to create forks at random points in time and then let them get stale. You're going to accumulate a bunch of islands that increasingly don't overlap with each other. And as time goes on, your delta opportunities get worse and worse. Which isn't to say the scheme doesn't work to some degree. It has been running in production at GitHub for several years. But it's very cobbled together, and I had always hoped to give it some more careful thought before sending it upstream. ;) That may be "the perfect is the enemy of the good" though (or at least the enemy of the mediocre). - the whole regex island-naming configuration feels complicated and hacky to me. I haven't been able to come up with a better solution, though. - the whole islandCore thing is it's own weirdness (more below) > The patches look good to me except perhaps that "pack.islandCore" is > not documented. Maybe something could be added in > Documentation/technical/ too, or maybe improving commit messages could > be enough. So basically what islandCore does is make one island "special", and we write all of its objects out first in the packfile. The idea here was that it would interact well with the pack-reuse code which tries to send the early part of the pack out verbatim. And if you could pick the fork which is most-often cloned, then that would be a win (so you'd pick torvalds/linux.git here). But: - the pack-reuse code we have upstream is crap; it doesn't kick in for this instance anyway, because it's too concerned with whether you're sending all of the pack. So if you have a 10GB pack and the "core" island is the entire first 1GB, you should be able to ship out that first GB easily. But the code says "well, that's only 10%!", which is not enough to trigger. I have replacement code (which we have been running in production) that is more clever about the threshold, and also can handle gaps in the continuity (so we might realize we need to send objects 1-5000, then skip a few, then 5037-8000, and so on). And after that, it's not at all clear to me if the islandCore thing is actually still helpful. - it's hard to configure, because it may actually change over time (whereas all the other island config is basically static and just impacts how we consider the refs that are there) - there may actually be _multiple_ forks that are "special". In torvalds/linux, for example, there's a fair bit of hierarchy, and there are people who forked Linus and have their own active sub-community. The actual island concept handles this kind of layering OK, but the islandCore thing is pretty static. So I'm not sure if islandCore is even needed or not. But somebody would have to do some experiments to figure it out (and obviously I should share the replacement pack-reuse patches). -Peff