Git for collaboration on RDF data

Mark Watts <watts.mark2015@xxxxxxxxx> · Fri, 6 Feb 2015 13:43:50 -0600

I'm interested in a collaboration and change management solution for
data stored in pre-existing RDF data stores set behind SPARQL
endpoints. I would like some input on my idea before I invest too much
time in reading about Git internals. My main question is whether
people more experienced in how Git works internally think that my
problem could be solved by using git itself or if I would be better
served by developing my own toolkit. These first four paragraphs are
to summarize why I'm even thinking of this solution.

I consider that externally managing the versioning of data and not
including that information in the data store would greatly reduce the
usefulness of tracking changes. For example, if multiple versions are
exposed through the SPARQL endpoint, readers would be able to compare
versions through querying with SPARQL rather than by referring back to
a serialized representation of the data in an adjacent repository.
This is most pertinent when the data store is accessed by non-human
agents since I expect that modifying a query or two in such an agent
is easier than adding a feature for reading from an adjacent
repository using a distinct set of protocols. Beyond that are the
dangers of expecting a different set of data than you receive and how
that's difficult to know without cryptographic guarantees of version
information.

I like the idea of using Git since it has gained a wide acceptance and
general understanding, even among the people outside of the software
development profession, who I expect will be generating most of the
data to be tracked. Then, when it comes to collaboration, I can see
that if, for example, I generate some preliminary data in my lab and I
want to share it in RDF, branching like in Git allows me to set off
this preliminary data, but make it available to peers while still
relating it to previously existing data.

My initial requirements for this solution are that commits and merges
shouldn't slow in the time it takes to complete them in proportion to
the size of the database since I want to track stores that can grow to
be millions of statements and several gigabytes in size. Based on my
expectations of the size of data being managed, I also think that
partial sharing of a repository would be useful, but I'm not certain
that this is a requirement.

My idea is to embed at least the object graph of Git in the managed
RDF graph and to make it possible to clone the tracked portion of a
graph by using SPARQL queries. Blobs would correspond to named graphs
in RDF and their hashes would be computed from a serialization of the
graph with "canonicalized" BNodes, and trees would be sets of triples
linking "tree nodes" to "tree entry nodes" to named graph identifiers
paired with blob object ids. For actually manipulating the commit
graph, I expect either to write my own tools or to use FUSE to expose
the RDF graph as a file system that git can manipulate like it does
for normal source code repositories. I like the second option, first,
because it means people can use readily available tools in a way
analogous to how they already use them, and, second, because it allows
for accessing features of Git to manipulate the RDF graph (for better
or worse) in ways that I don't have to explicitly define. My present
concerns for this second option are that I don't know yet everything
that git does on a file system to simulate it and whether I would like
the RDF graph that solution generates. The advantage of the first
option is that I know better what to expect if I go that route, and
the disadvantages are, essentially, the advantages of the second.

Any comment or criticism is welcome.

-- 
Cheers,

Mark W.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html