On Sun, Jul 10, 2016 at 12:48:13AM +0000, Eric Wong wrote: > Very much a work-in-progress, but NNTP and HTTP/HTTPS sorta work > based on stuff that is on gmane and stuff I'm accumulating by > being a subscriber. I checked this out when you posted it, and have been using it the past few weeks. I really like it. I find the URL structure much easier to navigate than gmane. I do find it visually a little harder to navigate through threads, because there's not much styling there, and the messages seem to run into one another. I don't know if a border around the divs or something would help. I'm really terrible at that kind of visual design. > HTTP URLs are clonable, but I've generated the following fast-export dump: > > https://public-inbox.org/.temp/git.vger.kernel.org-6c38c917e55c.gz > (362M) > [...] > In contrast, bundles and packs delta poorly and only get down > around 750-800M with aggressive packing I pulled this down. It is indeed rather huge, and git doesn't perform all that well with it. All the usual "git is not a database" things apply, I think. I noticed in particular that traversing the object graph is _really_ slow. This is very sensitive to the "branchiness" of the tree. I notice that you use a single level of hash (e.g., d4/9a37e4974...). Since there almost 300K messages, the average 2nd-level tree has over 1000 entries in it, and each commit changes exactly one entry. So what happens during a traversal is that we see some tree A, look at all of its entries, and see each of its blobs. Then we see A', the same tree with one entry different, and we still have to walk each of those thousand entries, looking up each in a hash only to find "yep, we already saw that blob". Whereas if your tree is more tree-like (rather than list-like), you can cull unchanged sub-trees more frequently. The tradeoff, though, is the extra overhead in storing the sha1 for the extra level of tree indirection. Here are some timing and size results for various incarnations of the packfile. The sizes come from: git cat-file --batch-all-objects \ --batch-check='%(objectsize:disk) %(objecttype)' | perl -lne ' /(\d+) (.*)/; $count{$2}++; $size{$2} += $1; END { print "$size{$_} ($count{$_}) $_" for sort(keys(%count)) }' And the timings are just "git rev-list --objects --all". Here's the initial sizes after fast-import: 536339725 (291113) blob 63767736 (291154) commit 929164567 (582290) tree Yikes, fast-import does a really terrible job of tree deltas (actually, I'm not even sure it finds tree deltas at all). Notice that blob contents are bigger than the fast-import stream (which contains all of those contents!). That's unfortunate, but comes from the fact that we zlib deflate the objects individually. Whereas the fast-import stream was compressed as a whole, so the common elements between the emails get a really good compression ratio. There was discussion a long time ago about storing a common zlib dictionary in the packfile and using it for all of the objects. I don't recall whether there were any patches, though. It does create some complications with serving clones/fetches, as they may ask for a subset of the objects (so you have to send them the whole dictionary, which may be a lot of overhead if you're only fetching a few objects). Anyway, here are numbers after an aggressive repack: 628307898 (291113) blob 63209416 (291154) commit 44342440 (582290) tree Much better trees. Ironically the blobs got worse. I think there are just too many with similar names and sizes for our heuristics to do a good job of finding deltas. Here's what running rev-list looks like: real 6m4.933s user 6m4.124s sys 0m0.616s Yow, that's pretty painful. Without bitmaps, that's an operation that every single clone would need to run. Here's what it looks like with an extra level of hashing (so storing "12/34/abcd..." instead of "12/34abcd..."): 628308433 (291113) blob 63207951 (291154) commit 60654550 (873339) tree We're storing a lot more trees, and spending 16MB extra on tree storage. But here's the rev-list time: real 0m55.120s user 0m55.016s sys 0m0.096s I didn't try doing an extra level of hashing on top of that (i.e., "12/34/ab/cd..."). It might help, but I suspect it's diminishing returns versus the cost of accessing the extra trees. The other thing that would probably make a big difference is avoiding the one-commit-per-message pattern. The commit objects aren't that big, but each one involves 2 new trees (one with ~1000 entries, and one with 256 entries). If you batched them into blocks of, say, 10 minutes, that drops the number of commits by half. Which I computed with: git log --reverse --format=%at | sort -n | perl -lne ' if (!@block) { @block = ($_); } else { my $diff = $_ - $block[0]; if ($diff >= 0 && $diff < 600) { push @block, $_; } else { print join(" ", @block); @block = ($_); } } END { print join(" ", @block) } ' Of course that means your mirror lags by 10 minutes. And you lose the cool property of "git log --author=peff", though of course that information is redundant with what is in the blobs. I haven't looked at the public-inbox code but I would imagine it's mostly operating on the tip tree. If you're willing to give up the cool commits, we could also just squash the whole archive into a single base commit, and start building there. We'd run into problems in another 10 years, I guess, but it would be pretty efficient to start with, at least. :) > Additional mirrors or forks (perhaps different UIs) are very welcome, > as I expect none of my servers or network connections to be reliable. I'm tempted to host a mirror at GitHub, but I'm wary of the Git storage. I don't think it really scales all that well. Bitmaps help with the cost of a clone, but they're not magic. We still have to do traversals for a lot of operations (including repacks). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html