On 04/03/2012 09:07 PM, Thomas Gummerer wrote: > On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote: >> On 04/02/2012 11:02 PM, Thomas Gummerer wrote: >>> - Append-only data structure >>> [...] >>> To make sure the index isn't corrupted, without calculating the sha1 hash for >>> the whole index file every time something is changed, the hash is always >>> calculated for the whole index when merging, but when only a single entry is >>> changed the sha-1 hash is only calculated for the last change. This will >>> increase the cost for reading the index to log(n) + k * log(k) where n is the >>> number of entries in the sorted part of the index and k is the number of entries >>> in the unsorted part of the index, which will have to be merged with the rest >>> of the index. >> >> I don't understand this analysis of the reading time. I suppose you are >> assuming that you want to read the status of a single file. But in that >> case, it is enough to find the entry in the old index (O(log(n)) >> assuming some sort of tree structure) plus do a linear scan through the >> unsorted entries (i.e., O(k), not O(k log(k))). > > The current way git operates it always reads the whole index, making it necessary > to merge the unsorted entries with the sorted part. Thinking about it it would even > be O(k log(n)), because the appended part is unsorted. > O(log(n)) + O(k) would be the complexity for loading only a single entry from the > index. I was confused because in your original mail you seemed to claim that reading the whole sorted part of the index scales like O(log(n)), where it certainly scales at least like O(n). To read the whole index if using an append-only data structure, I would do the following: 1. Read the file header to find where the addenda begin: one seek plus O(1). 2. Read the addenda in order (I assume each addendum to be sorted on disk), and merge-sort the addenda together, discarding the earlier of any duplicates: one seek plus O(k) I/O plus O(k log k) computation (this is the worst case, if each addendum contains a single file). 3. Read the sorted part of the file in order, while merging it together with the combined addenda: one seek plus O(n) I/O plus O(n + k) computation. Total: 3 seeks plus O(n+k) I/O plus O(n + k log(k)) computation. Whereas for a B-tree, it is hard to estimate the complexity because you have provided very little detail about how you want to lay the data structure out on disk. But presumably the number of seeks will be significantly larger. And if you are not careful, the number of seeks will approach the number of nodes in the index O(n) or perhaps the number of added nodes (not files!) which could go something like O(k log(k)). Michael -- Michael Haggerty mhagger@xxxxxxxxxxxx http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html