Re: Some tips for doing a CVS importer

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 30 Nov 2006 01:35:18 +0100

Markus Schiltknecht wrote:
> Michael Haggerty wrote:
>> This is the part that can get quite expensive for large repositories, as
>> there can be orders of magnitude more symbol creations than revisions.
>> According to Daniel Jacobowitz:
>>
>>> [...] at one point I believe the GCC repository was gaining up
>>> to four tags a day (head, two supported release branches, and one
>>> vendor branch).  I've been using the principal that the number of tags
>>> might be unworkable, but the number of branches generally is not.
>>
>> This means that the number of tag events is O(number-of-days *
>> total-number-of-files-in-repo), where the gcc repo has about 50000
>> files.  By contrast, only a small fraction of files is typically touched
>> in any day.
> 
> Yeah, 50'000 * 1825 (5 years) * say 100 bytes -> 8GB  sounds like a lot.
> OTOH, I certainly don't need 100 bytes per tag and one tag per day over
> five years is really a lot. Repositories that large are probably not
> converted to CVS on an old Pentium III...

...times 4 (tags per day) -> 32GB.  If I understand correctly, the tags
were created nightly by automated scripts.

I admit that this is an extreme example, but the philosophy of the
cvs2svn project (a philosophy that I inherited from my predecessors, by
the way) is to be able to handle the most absurd repositories out there.

> Well, almost. I meant a whole repository with these branches. If one
> file included all the branches it's getting easy to resolve. But for my
> example, I had something like that in mind:

I am glad that we are getting into concrete examples.  But your example
needs some clarifications (see below).

> fileA:
> 
> A = 1.2.2
> (no changes for branch B)
> C = 1.2.4      --> makes A a possible parent of branch C

In this case, ROOT can also be C's parent.

> D = 1.2.2.5.2  --> makes A a possible parent of branch D

This implies that A is *necessarily* the parent of D.  If there were a
E=1.2.2.5.4, then the parent of E would be ambiguous but the parent of D
would still unambiguously be A.

> X = 1.2.4      --> makes C a possible parent of tag X

Wait a minute.  A tag always has an even number of integers.  Do you
mean X=1.2 or X=1.2.4.1?  The same below.

> fileB:
> 
> A = 1.2.2
> B = 1.2.4      --> makes A a possible parent of branch B

or ROOT

> C = 1.2.6      --> makes B a possible parent of branch C

or A or ROOT

> D = 1.2.2.5.2  --> makes A a possible parent of branch D

A is unambiguously the parent of D

> X = 1.2.2.5.2  --> makes D a possible parent of tag X
>
> fileC:
> A = 1.2.2
> X = 1.2.2      --> makes A a possible parent of tag X
> 
> fileD:
> A = 1.2.2
> B = 1.2.4
> X = 1.2.4      --> makes B a possible parent of tag X
> 
>>> The symbol blob for branch A: has only one possible parent: ROOT. Thus I
>>> assign A->parent_branch = ROOT.
>>>
>>> Next comes the blob for branch C: it has two possible parents: branch B
>>> and branch A.
>>
>> Why is ROOT not considered as a possible parent of C?
> 
> Those were just examples. In my CVS-repository-in-mind, none of the
> files were branching from ROOT directly into C.

In your example, ROOT *is* a possible parent of C.

>>> At that point we know that A is derived from ROOT, but we
>>> don't have assigned a parent to B, yet. Thus we can not resolve C this
>>> time.
>>>
>>> Then comes branch B: one parent: A. Mark it.

In your example, ROOT is also a possible parent of B.

>>> Next round, we process C again: this time, we know B is branched from A.
>>> Thus we can remove the possible parent A. Leaving only one possible
>>> parent branch: B.
>>
>> But the fact that B preceded C chronologically does not mean that C is
>> derived from B.
> 
> No. And I don't assume so in any place. Given the files above, I can
> however clearly say that C got branched off from B, no?

No.  C is nowhere unambiguously derived from B, therefore its parent
could be ROOT, A, or B.  See my example below.

>> If I branch from ROOT or A after creating branch B, the
>> result as stored in CVS looks exactly the same as if I branch from B
>> (unless a file was modified between the creation of the parent branch
>> and the creation of the child branch).
> 
> Sure. That would result in an unresolvable symbol.
> 
>>> Now, say we have a tag 'X', which ended up in a blob having A, B, C and
>>> D as possible parent branches. I currently remove A and B, as they are
>>> parents of C. But C and D still remain and conflict. I'm unable to
>>> resolve that symbol. I'm thinking about leaving such conflicts to the
>>> user to resolve.

I don't know how to deal with tag X because the numbers that you
assigned to it above can't be correct.

Consider the attached script.  It unambiguously creates branches A1 and
A2 from ROOT and branch B from A1, then adds tag X on branch B.  But in
the files:

fileA symbols
        X:1.1
        B:1.1.0.6
        A2:1.1.0.4
        A1:1.1.0.2;

fileB symbols
        X:1.1.2.1
        B:1.1.2.1.0.2
        A2:1.1.0.4
        A1:1.1.0.2;

fileC symbols
        X:1.1.6.1
        B:1.1.0.6
        A2:1.1.0.4
        A1:1.1.0.2;

fileD symbols
        X:1.1
        B:1.1.0.4
        A2:1.2.0.2
        A1:1.1.0.2;

Note that from looking at fileA alone, there is no way to tell whether
A2 was created from ROOT or A1, or whether B was created from ROOT, A1,
or A2.  And tag X is all over the place, even though for each file it
was created from branch B.

If only information from fileA,v is considered, any of the following
branching topologies would give identical fileA,v contents:

      ROOT
      /|\
     / | \
    A1 A2 B

      ROOT
      / \
     /   \
    A1   A2
    |
    B

      ROOT
      / \
     /   \
    A1   A2
          |
          B

      ROOT
      / \
     /   \
    A1    B
    |
    A2

      ROOT
       |
       A1
      / \
     /   \
    A2    B

      ROOT
       |
       A1
       |
       A2
       |
       B

And from the information present in fileA,v, it is not possible to tell
whether tag X was applied to ROOT, A1, A2, or B.

(Some topologies *are* ruled out because the revision numbers are
ordered incorrectly; for example:

      ROOT
       |
       B
      / \
     /   \
    A1   A2

      ROOT
       |
       A2
       |
       A1
       |
       B

are not consistent with fileA,v.)

If we also consider the information in fileB, it is clear that branch
B's parent is branch A1, but it is still not clear whether branch A2's
parent is ROOT or A1, or whether tag X was applied to branch A1, A2, or B.

Similarly, fileC,v tells us that tag X was applied to branch B, and
fileD,v tells us that A2's parent is ROOT.

Each file alone is quite ambiguous, but in this case putting the
information from all files together (with the assumption that they have
a mutually-consistent history) is enough to reconstruct the entire
branching topology.

What's worse in real life?  Each file rules out some possible histories
and the goal is to find a history that is consistent with all files.  But...

- There can easily be cases where even the total information from all
files is still not enough to choose a unique history.  In such cases we
need a way to select between the possible histories.

- Since files in CVS don't necessarily *have* a globally consistent
branching/tagging history, heuristics have to be used in such cases to
find histories that apply to subsets of the repository in some
reasonable way (i.e., the one that is most likely considering the way
people typically work with CVS).

- "Unlabeled branches": often users have removed the label from a
branch, but the branch is still used as a source for other branches.
Figuring out this situation is a real mess.

I imagine that the best results (never mind whether it is practical)
would be obtained by recording the topology constraints implied by each
*,v file, then trying to map the topologies onto each other pair by pair
to (1) combine the constraints and thereby limit the possible histories
and (2) deduce which unlabeled branches correspond to one another.  But
I still don't know how to deal with inconsistent histories.  I think a
bottom-up approach would be the most sensible, given that people are
probably more likely to tag a whole subdirectory rather than files
scattered here and there.

The second step is to decide at what point in time a branch or tag
should be created, with the goal of being able to create it as a
snapshot of the source branch at that moment.  This is not always
possible, even if the branch topologies are compatible.

Michael

Attachment:
makerepo.sh

Description: application/shellscript