Re: Tracking file metadata in git -- fix metastore or enhance git?

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 8 Apr 2011 14:45:48 -0500

Thorsten Glaser wrote:
> Jonathan Nieder dixit:

>> I think the most native-looking way to store metadata associated to
>> paths is .gitattributes.  It also has the nice feature of allowing a
>> single attribute to apply to multiple files.
>
> Eh, no. Think of extended attributes like, say, NTFS Resource Forks.
> Theyâre just different âlinesâ into the âplaneâ a file can be, if
> you excuse the metapher. (All parallel, of course.)

Do you mean no, it doesn't have that feature? ;-)

Each git commit (try it with "git cat-file commit HEAD) looks like so:

	tree <tree name>
	parent <commit name for first parent>
	parent <commit name for second parent>
	...
	author <author identity and time of authorship>
	committer <committer identity and time committed>
	encoding <encoding of log message (optional)>

	<free-form change description>

Where could one sneak in some per-path metadata?

 - as new header fields after "encoder" (teaching git fsck, git commit
   --amend, and so on about it)?  That can work but it would slow down
   operations not interested in this metadata.  It is best not to have
   O(number of paths) header fields.

 - in the change description?  Yes, that can work, too, and it doesn't
   even require changing the commit format.

 - a new header field pointing to another object?  That is possible as
   a last resort.

Anyway, filenames and associated content are not what commits are
about; commits are just nodes in a revision graph, with trees representing
the tracked trees.

Okay, so what about the trees?

	<mode> SP <filename> NUL <object name>
	...

Where can we sneak something in?

 - use a currently invalid <mode>?  No, tracking metadata is probably
   not worth breaking old git clients.
 - use an invalid object name?  No (for the same reason).
 - use a special filename?  Then old git clients will treat the file
   as a regular file, so they still get access to the data.

So you see, using ordinary files (whether called .gitattributes or
foo.c.ntfs-resource-fork) to track this extra data makes a lot of
sense.

Now Michael mentioned an alternative, which is to store this
information in separate objects.  That way, you could push your
history without the extra metadata, you could edit the metadata
without changing the commit names of the history, separately
garbage-collect metadata you're not interested in, etc.  If that is
your goal, then "git notes" is exactly the right solution.

> They are just
> another facet of each file.

Sure, like the atime, the inode number, the uid of the user who wrote
them, and the model number of the disk used to store it.

Oh, you mean they're _relevant_ facets?  Yes, that's believable,
though I suspect that's only going to sometimes be the case.  So the
operator should say "yes, I'm interested in tracking this extra
information".  To summarize the above, some ways this could work
behind the scenes:

 * dotfiles with metadata;

 * a Makefile to install files with metadata (i.e., the "source"
   consists of plain files, while the "build product" has the
   specified metadata);

 * something else.  Hopefully the above explains the relevant
   constraints so you can surprise us.

Hope that helps.
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html