On Mon, Mar 26, 2012 at 08:25, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote: >> elton sky <eltonsky9404@xxxxxxxxx> writes: >> >>> On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy >>> <pclouds@xxxxxxxxx> wrote: >>>> (I think this should be on git@vger as there are many experienced devs there) >>>> >>>> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote: >>>>> About the new format: >>>>> >>>>> The index is a single file. Entries in the index still stored >>>>> sequentially as old format. The difference is they are grouped into >>>>> blocks. A block contains many entries and they are ordered by names. >>>>> Blocks are also ordered by the name of the first entry. Each block >>>>> contains a sha1 for entries in it. >>>> >>>> If I remove an entry in the first block, because blocks are of fixed >>>> size, you would need to shift all entries up by one, thus update all >>>> blocks? >>> >>> We need some GC here. I am not moving all blocks. Rather I would >>> consider merge or recycle the block. In a simple case if a block >>> becomes empty, I ll change the offset of new block in the header point >>> to this block, and make this block points to the original offset of >>> new block. In this way, I keep the list of empty blocks I can reuse. >> [...] >> >> Doesn't that venture into database land? >> >> If we go that far, wouldn't it be better to use a proper database >> library? All other things being equal, writing such complex code from >> scratch is probably not a good idea. > > If there's a library that fits our needs (including linking > statically). I think we've come close to sqlite file format [1]. But > sqlite comes with sql engine, transactional updates... that we don't > need. Another obvious source for inspiration is file systems, but I > dare not go that way. > > [1] http://www.sqlite.org/fileformat2.html Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but writes updates to new smaller files and eventually collapses everything back together into a bigger file. This can be a dramatically simpler approach than dealing with your own free block system inside of a single file. Its only real downside is needing to periodically pay a penalty to rewrite the whole index. But this rewrite is going to be faster than the time it takes to rewrite the pack files for the same repository, which git gc or git repack handles. So I don't think its actually a problem for the index. You might even be able to take a two level approach to compacting the LevelDb database (or something like it). In a minor compaction you compact all of the files except the huge base file, leaving you with 2 files. A huge base file that contains the first tree the user checked out, and a second smaller file containing any differences they have since the initial checkout (this may just be updated stat data for a handful of files that differed across two branches as they switched back and forth). During a git gc or git repack, add a new stage to collapse the base file and everything else into a single new base file as a major compaction. [2] http://code.google.com/p/leveldb/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html