[RFC] git-split: Split the history of a git repository by subdirectories and ranges

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I co-maintain the X C Binding (XCB) project with Jamey Sharp.
Previously, several XCB-related projects all existed under the umbrella
of a single monolithic GIT repository with per-project subdirectories.
We have split this repository into individual per-project repositories.

Jamey Sharp and I wrote a script called git-split to accomplish this
repository split. git-split reconstructs the history of a sub-project
previously stored in a subdirectory of a larger repository. It
constructs new commit objects based on the existing tree objects for the
subtree in each commit, and discards commits which do not affect the
history of the sub-project, as well as merges made unnecessary due to
these discarded commits.  When git-split finishes, it will output the
sha1 for the new head commit, suitable for redirection into a file in
.git/refs/heads.  At that point, you can clone the new head, or copy the
repository and prune out undesired heads, tags, and objects.

I have attached git-split for review.  If the GIT community has any
interest in seeing git-split become a part of GIT, we can write up the
necessary documentation and patch.

We would like to acknowledge the work of the gobby team in creating a
collaborative editor which greatly aided the development of git-split.

- Josh Triplett
#!/usr/bin/python
# git-split: Split the history of a git repository by subdirectories and ranges
# Copyright (C) 2006 Jamey Sharp, Josh Triplett
#
# You can redistribute this software and/or modify it under the terms of
# the GNU General Public License as published by the Free Software
# Foundation; version 2 dated June, 1991.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.

from itertools import izip
from subprocess import Popen, PIPE
import os, sys

def run(cmd, stdin=None, env={}):
    newenv = os.environ.copy()
    newenv.update(env)
    return Popen(cmd, stdin=PIPE, stdout=PIPE, env=newenv).communicate(stdin)[0]

def parse_author_date(s):
    """Given a GIT author or committer string, return (name, email, date)"""
    (name, email, time, timezone) = s.rsplit(None, 3)
    return (name, email[1:-1], time + " " + timezone)

def get_subtree(tree, name):
    output = run(["git-ls-tree", tree, name])
    if not output:
        return None
    return output.split()[2]

def is_ancestor(new_commits, cur, other):
    """Return True if cur has other as an ancestor, or False otherwise."""
    return run(["git-merge-base", cur, other]).strip() == other

def walk(commits, new_commits, commit_hash, project):
    commit = commits[commit_hash]
    if not(commit.has_key("new_hash")):
        tree = get_subtree(commit["tree"], project)
        commit["new_tree"] = tree
        if not tree:
            raise Exception("Did not find project in tree for commit " + commit_hash)
        new_parents = list(set([walk(commits, new_commits, parent, project)
                                for parent in commit["parents"]]))

        new_hash = None
        if len(new_parents) == 1:
            new_hash = new_parents[0]
        elif len(new_parents) == 2: # Check for unnecessary merge
            if is_ancestor(new_commits, new_parents[0], new_parents[1]):
                new_hash = new_parents[0]
            elif is_ancestor(new_commits, new_parents[1], new_parents[0]):
                new_hash = new_parents[1]
        if new_hash and new_commits[new_hash]["new_tree"] != tree:
            new_hash = None

        if not new_hash:
            args = ["git-commit-tree", tree]
            for new_parent in new_parents:
                args.extend(["-p", new_parent])
            env = dict(zip(["GIT_AUTHOR_"+n for n in ["NAME", "EMAIL", "DATE"]],
                           parse_author_date(commit["author"]))
                       +zip(["GIT_COMMITTER_"+n for n in ["NAME", "EMAIL", "DATE"]],
                            parse_author_date(commit["committer"])))
            new_hash = run(args, commit["message"], env).strip()

        commit["new_parents"] = new_parents
        commit["new_hash"] = new_hash
        if new_hash not in new_commits:
            new_commits[new_hash] = commit
    return commit["new_hash"]

def main(args):
    if not(1 <= len(args) <= 3):
        print "Usage: git-split subdir [newest [oldest]]"
        return 1

    project = args[0]
    if len(args) > 1:
        newest = args[1]
    else:
        newest = "HEAD"
    newest_hash = run(["git-rev-parse", newest]).strip()
    if len(args) > 2:
        oldest = args[2]
        oldest_hash = run(["git-rev-parse", oldest]).strip()
    else:
        oldest_hash = None

    grafts = {}
    try:
        for line in file(".git/info/grafts").read().split("\n"):
            if line:
                child, parents = line.split(None, 1)
                parents = parents.split()
                grafts[child] = parents
    except IOError:
        pass

    temp = run(["git-log", "--pretty=raw", newest_hash]).split("\n\n")
    commits = {}
    for headers,message in izip(temp[0::2], temp[1::2]):
        commit = {}
        commit_hash = None
        headers = [header.split(None, 1) for header in headers.split("\n")]
        for key,value in headers:
            if key == "parent":
                commit.setdefault("parents", []).append(value)
            elif key == "commit":
                commit_hash = value
            else:
                if key in commit:
                    raise Exception('Duplicate key "%s"' % key)
                commit[key] = value
        commit["message"] = "".join([line[4:]+"\n"
                                      for line in message.split("\n")])
        if commit_hash is None:
            raise Exception("Commit without hash")
        if commit_hash in grafts:
            commit["parents"] = grafts[commit_hash]
        if commit_hash == oldest_hash or "parents" not in commit:
            commit["parents"] = []
        commits[commit_hash] = commit

    print walk(commits, dict(), newest_hash, project)

try:
    import psyco
    psyco.full()
except ImportError:
    pass

if __name__ == "__main__": sys.exit(main(sys.argv[1:]))

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]