Re: Git's database structure

Andreas Ericsson <ae@xxxxxx> · Tue, 04 Sep 2007 18:29:50 +0200

Jon Smirl wrote:
On 9/4/07, Andreas Ericsson <ae@xxxxxx> wrote:
Jon Smirl wrote:
Let's back up a little bit from "Caclulating tree node".  What are the
elements of git's data structures?

Right now we have an index structure (tree nodes) integrated in to a
base table. Integrating indexing into the data is not normally done in
a database. Doing a normalization analysis like this may expose flaws
in the way the data is structured. Of course we may also decide to
leave everything the way it is.

What about the special status of a rename? In the current model we
effectively have three tables.

commit - a set of all SHAs in the commit, previous commit, comment, author, etc
blob - a file, permissions, etc.
file names - name, SHA
commit - SHA1 of its parent(s) and its root-tree, along with
         author info and a free-form field
blob - content addressable by *multiple trees*
file names - List of path-names inside a tree object.

To draw some sort of relationship model here, you'd have

commit 1<->M roottree
tree M<->M tree
tree M<->M blob

By introducing tree nodes you have blended a specific indexing scheme
into the data. There are many other ways the path names could be
indexed hash tables, binary trees, etc.

This problem exists in files systems. Since the path names have been
encoded into the directory structures there is no way to query
something like "all files created yesterday" from a file system
without building another mapping table or a brute force search. I keep
using Google as an example, Google is indexing hierarchical URLs but
they do not use a hierarchical index to do it.

Pathnames are by far the most common search-/delimiting criteria for
git though, so I fail to see why this is a problem for you.

Databases keep the knowledge of how things are indexed out of the
data. A data structure analysis of git should remove the blended index
and start from the set theory.

Why? This is the core of the problem, really. You haven't specified a
single, real-life reason *why* it should be any other way than it
already is. It sounds a bit to me as if you've been to a really
inspiring seminar about "how database-like things *should* be done"
and then decided to go berserk on your favourite database-like thing,
which is git.

Code and benchmarks or bust. In the meantime, I'll settle for a recount
of what problems you're having with the current layout, or what gains
you're hoping to achieve with the new one. As it's the 3rd time I'm
asking, this'll be the last.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html