Re: fast-import and unique objects.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/8/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
> We're designing a dumpfile format for git like the one SVN has.

I'm not sure I'd call it a dumpfile format.  More like an importfile
format.  Reading a GIT pack is really pretty trivial; if someone was
going to write a parser/reader to pull apart a GIT repository and
use that information in another way they would just do it against
the pack files.  Its really not that much code.  But generating a
pack efficiently for a large volume of data is slightly less trivial;
the attempt here is to produce some tool that can take a relatively
trivial data stream and produce a reasonable (but not necessarily
absolute smallest) pack from it in the least amount of CPU and
disk time necessary to do the job.  I would hope that nobody would
seriously consider dumping a GIT repository back INTO this format!

[snip]
> AFAIK the svn code doesn't do merge commits. We probably need a post
> processing pass in the git repo that finds the merges and closes off
> the branches. gitk won't be pretty with 1,500 open branches. This may
> need some manual clues.

*wince* 1500 open branches.  Youch.  OK, that answers a lot of
questions for me with regards to memory handling in fast-import.
Which you provide excellent suggestions for below.  I guess I didn't
think you had nearly that many...

[snip]
> The file names are used over and over. Alloc a giant chunk of memory
> and keep appending the file name strings to it. Then build a little
> tree so that you can look up existing names. i.e. turn the files names
> into atoms. Never delete anything.

Agreed.  For 1500 branches its worth doing.

[snip]
> About 100,000 files in the initial change set that builds the repo.
> FInal repo has 120,000 files.
>
> There are 1,500 branches. I haven't looked at the svn dump file format
> for branches, but I suspect that it sends everything on a branch out
> at once and doesn't intersperse it with the trunk commits.

If you can tell fast-import your are completely done processing a
branch I can recycle the memory I have tied up for that branch; but
if that's going to be difficult then...  hmm.

Right now I'm looking at around 5 MB/branch, based on implementing
the memory handling optimizations you suggested.  That's still *huge*
for 1500 branches.  I clearly can't hang onto every branch in memory
for the entire life of the import like I was planning on doing.
I'll kick that around for a couple of hours and see what I come
up with.

Some of these branches are what cvs2svn calls unlabeled branches.
cvs2svn is probably creating more of these than necessary since the
code for coalescing them into a single big unlabeled branch is not
that good.

I attached the list of branch names being generated.




--
Shawn.



--
Jon Smirl
jonsmirl@xxxxxxxxx

Attachment: cvs2svn-branches.txt.bz2
Description: BZip2 compressed data


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]