Re: help needed: Splitting a git repository after subversion migration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2008.12.08 14:30:28 +0100, Michael J Gruber wrote:
> Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41:
> > Hello together,
> > 
> > I've successfully imported a large subversion repository into git.
> > The tree contains source code and binary data ("releases"),
> > the resulting .git directory is about 11GB.
> > 
> > After the import I recreated the tags/branches by converting the refs
> > to the subversion tags using a small shell script from the web:
> > 
> > for branch in `git branch -r`; do
> >      ...
> >      version=`basename $branch`
> >      git tag -s -f -m "$subject" "$version" "$branch^"
> >      git branch -d -r $branch
> > done
> > 
> > Ok, so far everything went really smooth. I wanted to split this repository
> > into two repositories, one for the source code and one for the binary data.
> > The current tree layout is like this:
> > 
> > sources/c++_xyz
> > releases/large_binary_data
> > ...
> > 
> > The original tree was imported from CVS to subversion and the layout
> > of the trunk was once reorganized/moved later. Here's the command
> > I used to split out the "source" tree:
> > 
> > git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
> > CVSROOT Attic source/Attic develpkg/Attic
> > source/packages/Attic releases update_pkg' -- --all
> > 
> > After that I ran these commands to reclaim the space:
> > - git clone --no-hardlinks filtered_tree final_output
> > - cd final_output
> > - git gc
> > - git prune
> > - git repack -a -d --depth=250 --window=250
> > 
> > Unfortunately the .git directory of the "source" tree is still 7.5GB big.
> > 
> > When I just imported the "trunk" from subversion without any tags
> > and then ran "git filter-branch --subdirectory-filter source" + git gc,
> > the .git directory was about 1.5GB afterwards.
> > 
> > How can I find out where those other 6GB go to?
> > I already looked at the tags with gitk,
> > there's no sign of the releases/* stuff left.
> 
> I strongly suspect the reorganization/move to be the cause. Most
> probably some releases were put in places where you don't expect them,
> and therefore they are not filtered out by removing the releases subdir.
> If they have distinguished file names (say you know a name from before
> the move) you can find them using "git log". Or use gitk --all, switch
> to "tree display" and look for unexpected files in the earliest revisions.

If it's about huge objects, and not just lots of small objects, you can
use this:

# Find large objects
git rev-list --objects --all | cut -f1 -d' ' | \
	git cat-file --batch-check | grep blob | sort -n -k 3

This outputs lines in the format:
<object_hash> blob <object_size>

sorted by object size, large objects come last. To make use of that
information, you'll likely need to also find the filename(s) that are
used for these blobs:

# Find filenames for objects
git rev-list --all --objects | grep <object_hash>

And then you can use the filenames to do some more filtering.

Björn
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux