On 9/4/07, Julian Phillips <julian@xxxxxxxxxxxxxxxxx> wrote: > On Tue, 4 Sep 2007, Jon Smirl wrote: > > > Let's back up a little bit from "Caclulating tree node". What are the > > elements of git's data structures? > > > > Right now we have an index structure (tree nodes) integrated in to a > > base table. Integrating indexing into the data is not normally done in > > a database. Doing a normalization analysis like this may expose flaws > > in the way the data is structured. Of course we may also decide to > > leave everything the way it is. > > > > What about the special status of a rename? In the current model we > > effectively have three tables. > > > > commit - a set of all SHAs in the commit, previous commit, comment, author, etc > > blob - a file, permissions, etc. > > file names - name, SHA > > > > The file name table is encoded as an index and it has been > > intermingled with the commit table. > > > > Looking at this from a set theory angle brings up the question, do we > > really have three tables and file names are an independent variable > > from the blobs, or should file names be an attribute of the blob? > > There isn't a one-to-one mapping of file names to blobs. The blob only > describes the contents of the file. In the extreme case you could have > one blob for every single file in your tree. For example: > > # git ls-tree -r HEAD > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c bar/foo > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo2 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo3 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo4 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo5 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo6 Both schemes support aliasing. In the flat scheme you would create a second blob which contains the file and the aliased path name. When the blob gets delta'd the second copy of the file will disappear. I'm not proposing a change to data being stored in git, it is a proposal to consider the impacts of how this data has been normalized in the data store. > > How this gets structured in the db is an independent question about > > how renames get detected on a commit. The current scheme for detecting > > renames by comparing diffs is working fine. The question is, once we > > detect a rename how should it be stored? > > > > Ignoring the performance impacts and looking at the problem from the > > set theory view point, should: > > the pathnames be in their own table with a row for each alias > > the pathnames be stored as an attribute of the blob > > > > Both of these are the same information, we're just looking at how > > things are normalized. > > > > > > -- > Julian > > --- > "You shouldn't make my toaster angry." > -- Household security explained in "Johnny Quest" > -- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html