Git tree object storing policy

Ivan Tolstosheyev <ivan.tolstosheyev@xxxxxxxxx> · Tue, 21 Feb 2012 09:22:12 +0000 (UTC)

Hello,

now tree object is a simple list of <attributes, hash, name>sorted by name
(tricky sorted, cause we assuming that directory name "$X" is actually "$X/"
in comparison function). The problem is, that if I want to insert 10k files
in empty git repository on / folder there will be 10k new 
trees with sizes from (1 to 10k)*(hash+name+attribute)+eps  .

itroot@localhost ~/tmp> cat git-test.sh 
#!/usr/bin/env bash

git init test
cd test
for i in `seq 1 10000` 
do
touch ${i} ; git add ${i} ; git commit -m "Add ${i}" ;
done
cd ..
du -hs test
itroot@localhost ~/tmp>

itroot@localhost ~/tmp> ./git-test.sh
...
180M	test
itroot@localhost ~/tmp>

180 MB!!!?? and 7.4M after `git gc` - thanks to delta compression!

Ok, you can say that this example is artificial, and I can add 10k files
with 1 commit. Thats true. But manipulating files in big tree objects
(in a big directories) is storage-expensive, and if I need to store a 
lot of files in one directory and frequently change them - git just
don't scales now properly at this use-case.

What do I propose? 
We can add another git object, named for example "btree" , 
that contains another "btree" objects or files.  This will be a simple
btree structure (tree entries sorted practically by name, BTW,
maybe it's time to fix sorting =] ), that allows us to do insertion,
removal, search in ln(n) time. But - we do not have troubles 
with big direcories now. BTW, if all directories are small, btree
will be tree-like - just btree pointing to  files.
So, one big tree with 10k files transforms to (hmm, for example...)
101 btrees - one, pointing to 100 btrees, and thay points to files.
(100 entries per btree is a wild guess =) )

Suggestions?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html