On 2008.12.08 14:30:28 +0100, Michael J Gruber wrote: > Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41: > > Hello together, > > > > I've successfully imported a large subversion repository into git. > > The tree contains source code and binary data ("releases"), > > the resulting .git directory is about 11GB. > > > > After the import I recreated the tags/branches by converting the refs > > to the subversion tags using a small shell script from the web: > > > > for branch in `git branch -r`; do > > ... > > version=`basename $branch` > > git tag -s -f -m "$subject" "$version" "$branch^" > > git branch -d -r $branch > > done > > > > Ok, so far everything went really smooth. I wanted to split this repository > > into two repositories, one for the source code and one for the binary data. > > The current tree layout is like this: > > > > sources/c++_xyz > > releases/large_binary_data > > ... > > > > The original tree was imported from CVS to subversion and the layout > > of the trunk was once reorganized/moved later. Here's the command > > I used to split out the "source" tree: > > > > git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f > > CVSROOT Attic source/Attic develpkg/Attic > > source/packages/Attic releases update_pkg' -- --all > > > > After that I ran these commands to reclaim the space: > > - git clone --no-hardlinks filtered_tree final_output > > - cd final_output > > - git gc > > - git prune > > - git repack -a -d --depth=250 --window=250 > > > > Unfortunately the .git directory of the "source" tree is still 7.5GB big. > > > > When I just imported the "trunk" from subversion without any tags > > and then ran "git filter-branch --subdirectory-filter source" + git gc, > > the .git directory was about 1.5GB afterwards. > > > > How can I find out where those other 6GB go to? > > I already looked at the tags with gitk, > > there's no sign of the releases/* stuff left. > > I strongly suspect the reorganization/move to be the cause. Most > probably some releases were put in places where you don't expect them, > and therefore they are not filtered out by removing the releases subdir. > If they have distinguished file names (say you know a name from before > the move) you can find them using "git log". Or use gitk --all, switch > to "tree display" and look for unexpected files in the earliest revisions. If it's about huge objects, and not just lots of small objects, you can use this: # Find large objects git rev-list --objects --all | cut -f1 -d' ' | \ git cat-file --batch-check | grep blob | sort -n -k 3 This outputs lines in the format: <object_hash> blob <object_size> sorted by object size, large objects come last. To make use of that information, you'll likely need to also find the filename(s) that are used for these blobs: # Find filenames for objects git rev-list --all --objects | grep <object_hash> And then you can use the filenames to do some more filtering. Björn -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html