On Tue, Jun 14, 2011 at 12:47, Jeff King <peff@xxxxxxxx> wrote: > On Tue, Jun 14, 2011 at 12:20:29PM -0700, Shawn O. Pearce wrote: > >> > We would want to store the cache in an on-disk format that could be >> > searched easily. Possibly something like the packed-refs format would be >> > sufficient, if we mmap'd and binary searched it. It would be dirt simple >> > if we used an existing key/value store like gdbm or tokyocabinet, but we >> > usually try to avoid extra dependencies. >> >> Yea, not a bad idea. Use a series of SSTable like things, like Hadoop >> uses. It doesn't need to be as complex as the Hadoop SSTable concept. >> But a simple sorted string to string mapping file that is immutable, >> with edits applied by creating an overlay file that contains >> new/updated entries. >> >> As you point out, we can use the notes tree to tell us the validity of >> the cache, and do incremental updates. If the current cache doesn't >> match the notes ref, compute the tree diff between the current cache's >> source tree and the new tree, and create a new SSTable like thing that >> has the relevant updates as an overlay of the existing tables. After >> some time you will have many of these little overlay files, and a GC >> can just merge them down to a single file. > > I was really hoping that it would be fast enough that we could simply > blow away the old mapping and recreate it from scratch. That gets us out > of writing any journaling-type code with overlays. For something like > svn revisions, it's probably fine to take an extra second or two to > build the cache after we do a fetch. But it wouldn't scale to something > that was getting updated frequently. > > If we're going to start doing clever database-y things, I'd much rather > use a proven key/value db solution like tokyocabinet. I'm just not sure > how to degrade gracefully when the db library isn't available. Don't > allow reverse mappings? Fallback to something slow? This is why I would prefer to build the solution into Git. Its not that bad to do a sorted string file. Take something simple that is similar to a pack file: GSST | vers | rcnt | srctree | base [ klen key vlen value ]* [ roff ]* SHA1(all_of_above) Where vers is a version number, rcnt is the number of records, srctree is the SHA-1 of the notes tree this thing indexed from, and base is the SHA-1 of the notes tree this "applies on top of". There are then rcnt records in the file, each using a variable length key length and value length field (klen, vlen), with variable length key and values. At the end of the file are rcnt 4 byte offsets to the start of each key. When writing the file, write all of the above to a temporary file, then rename it to $GIT_DIR/cache/db-$SHA1.db, as its a unique name. Its easy to prepare the list of entries in memory as an array of structs of key/value pairs, sort them with qsort(), write them out and update offsets as you go, then dump out the offset table at the end. One could compress the offset table by only storing every N offsets, readers perform binary search until they find the first key that is before their desired key, then sequentially scan records until they locate the correct entry... but I'm not sure the space savings is really worthwhile here. When reading, scan the directory and read the headers of each file. If the file has your target srctree, your cache is current and you can read it. If a key isn't in this file, you open the file named $GIT_DIR/cache/db-$base.db and try again there, walking back along that base chain until base is '0'x40. (or some other marker in the header to denote there is no base file). GC is just a matter of merging the sorted files together. Follow along all of the base pointers, open all of them, scan through the records and write out the first key that is defined. I guess we need a small "delete" bit in the record to indicate a particular key/value was removed from the database. Since this is a reverse mapping, duplicates are possible, and readers that want all values need to scan back to the base file, but skip base entries that were marked deleted in a newer file. Updating is just preparing a new file that uses the current srctree as your base, and only inserting/sorting the paths that were different in the notes. We probably need to store these files keyed by their notes ref, so we can find "svn/revisions" differently from "bugzilla" (another hypothetical mapping of Bugzilla bug ids to commit SHA-1s, based on notes that attached bug numbers to commits). I don't think its that bad. Maybe its a bit too much complexity for version 1 to have these incremental update files be supported, but it shouldn't be that hard. >> The only problem is, you probably want this "reverse notes index" to >> be indexing a portion of the note blob text, not all of it. That is, >> we want the SVN note text to say something like "SVN Revision: r1828" >> so `git log --notes=svn` shows us something more useful than just >> "r1828". But in the reverse index, we may only want the key to be >> "r1828". So you need some sort of small mapping function to decide >> what to put into that reverse index. > > I had assumed that we would just be writing r1828 into the note. The > output via git log is actually pretty readable: > > $ git notes --ref=svn/revisions add -m r1828 > $ git show --notes=svn/revisions > ... > Notes (svn/revisions): > r1828 > > Of course this is just one use case. Thanks, I keep forgetting that the notes prints the note ref name out before the text, so its already got this annotation present. This makes it much more likely that the bare "r1828" text is acceptable in the note, and that the reverse index is just the entire content of the blob as the key. :-) > For that matter, we have to figure out how one would actually reference > the reverse mapping. If we have a simple, pure-reverse mapping, we can > just generate and cache them on the fly, and give a special syntax. > Like: > > $ git log notes/svn/revisions@{revnote:r1828} Uhm. Ick. > The syntaxes are not as nice as having a real ref. In the last example, > we could probably look for the contents of "@{}" as a possible revnote > mapping (since we've already had to name it via the configuration), to > make it "r1828@{svn}". Or you could even come up with a default set of > revnotes to consider, so that if we lookup "r1828" and it isn't a real > ref, we fall back to trying r1828@{revnote:svn}. Or, what about setting up a fake ref namespace: git config ref.refs/remotes/svn/*.from refs/notes/svn/revisions Then `git log svn/r1828` works. But these aren't real references. We would only want to consider them if a request matched the glob, so `git for-each-ref` and `git upload-pack` aren't reporting these things by default, and neither is `git log --all` or `gitk --all`. I agree a syntax that works out of the box without a configuration file change would be nicer. But we are running out of operators to do that with. `git log notes/svn/revisions@{revnote:r1828}` as you propose above is at least workable... -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html