Let me start by echoing Junio's remark... "lacks sufficient justification". You don't give enough evidence to support even why it is worth looking at this commit, let alone why it should be included and cause a format change in the idx file format. At some point you start to hand-wave about how it may benefit my bitmap reachability index idea, but you don't seem to fully understand that idea. And some of what you do here sounds like it actively hurts doing the bitmap reachability project. So again this doesn't really help. Colby is nearly done prototyping the bitmap reachability implementation in JGit and will release the code under the BSD license there soon. I can't promise when yet because Colby will soon be heading out for some (much deserved) vacation time, and doesn't want to publish something until he is ready to stand behind the code. So it might not show up until mid-September. But I think its worth giving him a few weeks to finish getting the code ready, vs. rushing something in that someone else thinks might help. We have waited more than 6 years or whatever to improve packing. Colby's experiments are showing massive improvements (40s enumeration cut to 100ms) with low disk usage increase (<10%) and no pack file format changes. Its OK to wait 4 more weeks. On Mon, Aug 13, 2012 at 12:40 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes: > >> On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote: >>> For example, the reachability bitmap would want to say something >>> like "Traversing from commit A, these objects in this pack are >>> reachable." The bitmap for one commit A would logically consist of >>> N bits for a packfile that stores N objects (the resulting bitmap >>> needs to be compressed before going to disk, perhaps with RLE or >>> something). With the single "sorted by SHA-1" table, we can use the >>> index in that single table to enumerate all reachable objects of any >>> type in one go. With four separate tables, on the other hand, we >>> would need four bitmaps per commit. >> >> No we still need one per commit. The n-th bit is in the order of the >> object in the pack, not the index. How sha-1 is sorted does not >> matter. Yes, for RLE encoding to work well it is important that the bits are assigned using pack offset ordering, not SHA-1 ordering. The object at bit 0 is the object at offset 12 in the pack file, not the object that has the lowest SHA-1. The RLE encoding is relying on the packer to store objects closely reachable by time near each other in the pack file. Which the packer already does because we have roughly proven though the years that this ordering works sufficiently well enough for nobody to propose another ordering. :-) > Then what is the problem of using that same "N" that counts the > object using the order of the objects in the pack as a short-hand to > the object name instead, if your objective is to avoid having to > repeat the full 20-byte object name in your secondary tables? That > does not call for any split in the list of SHA-1s, no? Exactly. There is no need to store a second copy of the SHA-1s anywhere. We can discover information about object identified by bit N by creating the reverse pack index in memory. The N-th entry in the reverse index will tell us the Y-th position in the normal idx file, which gives us the SHA-1. So SHA-1 can be obtained from a bitmap in constant time, once the reverse index is built. The reverse index build costs O(N log N) time, but its already required by the packer to do efficient reuse of packed objects. We already pay this O(N log N) penalty. So its currently free to us. The bitmap approach Colby and I are collaborating on also contains 4 RLE bitmaps indicating what type each object is. Its needed so the packer can avoid looking up type data in the base pack file, while also avoiding parsing of trees where the mode information normally provides the type hint. We could instead get this by splitting the SHA-1 table into 4 tables as this patch proposes, but it doesn't help. Type information can be obtained fast enough from these 4 RLE bitmaps, without muddying the read code path with needing to know what table to search, or needing to search 4 tables rather than 1 table. >>> Either way is _possible_, but I think the former is simpler, and the >>> latter makes it harder to introduce new types of objects in the >>> future, which I do not think we have examined possible use cases >>> well enough to make that decision to say "four types is enough >>> forever". >> >> New types can be put in one of those four tables, depending on its >> purpose. No. A new type should get a new table. We should never cram a new type into an existing type table just because it sounds like a good idea. Its not. We are unlikely to add a new type anytime soon. If we do its going to take more work than just worrying about where it fits into an idx file. But it shouldn't be combined into a bucket with an existing type. >> The reason I split because I care particularly about commits >> and trees. If the new type serves the same purpose as tree, for >> example, then it's better put in tree table... It isn't worth speculating about this. We aren't likely to add a new type. If we do its probably not like another type. > And when there are multiple uses, one that wants to treat trees and > commits more or less the same way, and another that wants to treat > them in a distinct way, even without considering a new type, how > would your "we have four separate tables" help? The former user > wants three "commits+trees", "tags" and "blobs" table while the > latter wants four, no? And this paragraph alone indicates why splitting the SHA-1 table is a stupid idea. Readers who have a SHA-1 in hand shouldn't have to dig around in 4 different tables trying to figure out what it wants. Or maybe 3 tables if it thinks its tree-ish. Its just a stupid discussion to have. At this level of the system the object database is a database keyed by SHA-1s. Simple. Don't muddle it. >>> In either way, we would have such bitmap (or a set of four bitmaps >>> in your case) for more than one commit (it is not necessary or >>> desirable to add the reachability bitmap to all commits), and such a >>> "reachability extension" would need to store a sequence of "the >>> commit object name the bitmap (or a set of four bitmaps) is about, >>> and the bitmap (or set of four bitmaps)". That object name does not >>> have to be 20-byte but would be a varint representation of the >>> offset into the "sorted by SHA-1" table. Yes. >> How do you reach the bitmap, given its commit sha-1? > > Look at the ordered list of SHA-1 in the pack idx file as an array > of 20-byte entries, and find the commit SHA-1 in that array by > Newton-Raphson/bisection. You have an index into the array that > holds the object name. You can use it as the local object number > in the packfile. Then you can find that object number (that is > shorter than 20-byte object name) in your secondary table, no? Yes, exactly. But I am trying to steer Colby into a slightly different direction. For ~222k commits we need only about 2k reachability bitmaps. If these are keyed by N-th SHA-1 as Junio states, we can lookup the object in the object map and mark a flag on it saying "reachability bitmap exists on this commit". Later if the revision walker hits a commit with this flag set, we know we should look at that bitmap information. Allocating 2k commits into the revision walking map is fairly cheap, and makes it fast for the walker to later identify where a bitmap exists. But I don't have any data yet to prove what the fastest way to manage finding the right bitmaps is. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html