Re: Git's database structure

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Tue, 4 Sep 2007 13:30:30 -0400

On 9/4/07, Julian Phillips <julian@xxxxxxxxxxxxxxxxx> wrote:
> On Tue, 4 Sep 2007, Jon Smirl wrote:
>
> > Let's back up a little bit from "Caclulating tree node".  What are the
> > elements of git's data structures?
> >
> > Right now we have an index structure (tree nodes) integrated in to a
> > base table. Integrating indexing into the data is not normally done in
> > a database. Doing a normalization analysis like this may expose flaws
> > in the way the data is structured. Of course we may also decide to
> > leave everything the way it is.
> >
> > What about the special status of a rename? In the current model we
> > effectively have three tables.
> >
> > commit - a set of all SHAs in the commit, previous commit, comment, author, etc
> > blob - a file, permissions, etc.
> > file names - name, SHA
> >
> > The file name table is encoded as an index and it has been
> > intermingled with the commit table.
> >
> > Looking at this from a set theory angle brings up the question, do we
> > really have three tables and file names are an independent variable
> > from the blobs, or should file names be an attribute of the blob?
>
> There isn't a one-to-one mapping of file names to blobs.  The blob only
> describes the contents of the file.  In the extreme case you could have
> one blob for every single file in your tree.  For example:
>
> # git ls-tree -r HEAD
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    bar/foo
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo2
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo3
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo4
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo5
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo6

Both schemes support aliasing. In the flat scheme you would create a
second blob which contains the file and the aliased path name. When
the blob gets delta'd the second copy of the file will disappear.

I'm not proposing a change to data being stored in git, it is a
proposal to consider the impacts of how this data has been normalized
in the data store.

> > How this gets structured in the db is an independent question about
> > how renames get detected on a commit. The current scheme for detecting
> > renames by comparing diffs is working fine. The question is, once we
> > detect a rename how should it be stored?
> >
> > Ignoring the performance impacts and looking at the problem from the
> > set theory view point, should:
> > the pathnames be in their own table with a row for each alias
> > the pathnames be stored as an attribute of the blob
> >
> > Both of these are the same information, we're just looking at how
> > things are normalized.
> >
> >
>
> --
> Julian
>
>   ---
> "You shouldn't make my toaster angry."
> -- Household security explained in "Johnny Quest"
>

-- 
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html