On Wed, Jun 26, 2013 at 7:45 PM, Jeff King <peff@xxxxxxxx> wrote: > > In particular, it seems like the slowness we saw with the v1 bitmap > format is not what Shawn and Colby have experienced. So it's possible > that our test setup is bad or different. Or maybe the C v1 reading > implementation had some problems that are fixable. It's hard to say > because we haven't shown any code that can be timed and compared. Right, the format and implementation in JGit can do "Counting objects" in 87ms for the Linux kernel history. But I think we are comparing apples to steaks here, Vincent is (rightfully) concerned about process startup performance, whereas our timings were assuming the process was already running. It would help everyone to understand the issues involved if we are at least looking at the same file format. > And the pack-order versus idx-order for the bitmaps is still up in the > air. Do we have numbers on the on-disk sizes of the resulting EWAHs? I did not see any presented in this thread, and I am very interested in this aspect of the series. The path hash cache should be taking about 9.9M of disk space, but I recall reading the bitmap file is 8M. I don't understand. Colby and I were very concerned about the size of the EWAH compressed bitmaps because we wanted hundreds of them for a large history like the kernel, and we wanted to minimize the amount of memory consumed by the bitmap index when loaded into the process. In the JGit implementation our copy of Linus' tree has 3.1M objects, an 81.5 MiB idx file, and a 3.8 MiB bitmap file. We were trying to keep the overhead below 10% of the idx file, and I think we have succeeded on that. With 3.1M objects the v2 bitmap proposed in this thread needs at least 11.8M, or 14+% overhead just for the path hash cache. The path hash cache may still be required, Colby and I have been debating the merits of having the data available for delta compression vs. the increase in memory required to hold it. > The > pack-order ones should be more amenable to run-length encoding, > especially as you get further down into history (the tip ones would > mostly be 1's, no matter how you order them). This is also true for the type bitmaps, but especially so in the JGit file ordering where we always write all trees before any blobs. The type bitmaps are very compact and basically amount to defining a single range in the file. This takes only a few words in the EWAH compressed format. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html