On Mon, 8 May 2006, Jeff King wrote: > > On Sun, May 07, 2006 at 08:27:02AM -0700, Linus Torvalds wrote: > > > factor for a lot of things for many "common" filesystem setups. You > > probably didn't even account for the size of inodes in your "du" setup. > > My numbers came from git-count-objects, which uses the st_blocks sum for > all objects. The actual du numbers showing space wasted by block > boundaries are: > du -c ??: 1429216 > du -c --apparent-size ??: 792277 > So it's about 45% wasted space. And that's actually ignoring inode sizes and directory sizes (well, it doesn't "ignore" directory sizes - it counts them - but if you compare it to a straight packed format, it's still overhead). Anyway, looks like it's about 2:1, not 3:1 like I claimed, but the point being that blocking factors tend to be at least on the same order of magnitude as just plain compression (which also tends to be in the 2:1 area for normal, fairly easily compressible, stuff). The delta-packing obviously is much bigger for any project with real history. In traditional setups (where you always delta-pack within one thing, ie at the level of individual SCCS/RCS files), the delta packing obviously _also_ avoids blocking issues, since it means that a thousand revisions of the same file will all share the same inode. So because git uses a whole-file model, it obviously makes the blocking issues with its unpacked format _much_ higher than for any traditional medium - no conglomeration of different versions of the file in the same filesystem object. On the other hand, the packed format also tends to be even _more_ efficient than a traditional one, so the end result of it all is apparently a pretty big net win even in space consumption). Side note: I realize that some people think the packs are ugly and strange. They aren't linear versions of a file, and instead appear as a fairly random "jumble". And they can't be incrementally re-packed, and you have to generate a whole new pack-file (which can be incremental in _content_, of course). So people think they are ugly. I'd argue that they are beautiful. They are beautiful because they _don't_ contain history in themselves (the objects they contain encode the history of course, but the pack-file itself does not). And they are beautiful because we can use the exact same format for streaming data over the network as for the database itself (that, of course, was just about _the_ design consideration). Show me another system that has exactly the same (not "similar", not "same concepts": _same_) network protocol as it internal database. And they are beautiful exactly because their lack of any internal structure allows you to pack things by criteria _you_ care about, ie the whole "sort things by recency" thing, so that commonly accessed data can be packed at the head of the pack-file - exactly because the pack-file doesn't have any internal structure of its own that you need to worry about and that constrains your sorting. The same thing is what allows you to delta any blob against any other blob - without worrying about history or other random pack-file rules. You can do packign purely by how well you want to pack, not by any secondary constraints. And the "no incremental updates" may sound like a huge downside, but it's all the same basic git logic: objects and filesystem contents are immutable, and that allows us to avoid a lot of locking overhead. Locking is _hard_. Locking is _inefficient_. And locking really really screws you when you miss it. So I'll happily say that pack-files are strange, and that you have to get a bit used to the notion that they should be repacked "asynchronously". But it's really a matter of "getting used to it", because once you do, you'll see that it's actually an absolutely huge deal, and you'll learn to love the bomb^H^H^H^Hpack-file. Linus "pack-files rule" Torvalds - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html