Re: [GSoC] Designing a faster index format

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 05 Apr 2012 12:43:37 +0200

On 04/03/2012 09:07 PM, Thomas Gummerer wrote:
> On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote:
>> On 04/02/2012 11:02 PM, Thomas Gummerer wrote:
>>> - Append-only data structure
>>> [...]
>>> To make sure the index isn't corrupted, without calculating the sha1 hash for
>>> the whole index file every time something is changed, the hash is always
>>> calculated for the whole index when merging, but when only a single entry is 
>>> changed the sha-1 hash is only calculated for the last change. This will 
>>> increase the cost for reading the index to log(n) + k * log(k) where n is the 
>>> number of entries in the sorted part of the index and k is the number of entries
>>> in the unsorted part of the index, which will have to be merged with the rest 
>>> of the index.
>>
>> I don't understand this analysis of the reading time.  I suppose you are
>> assuming that you want to read the status of a single file.  But in that
>> case, it is enough to find the entry in the old index (O(log(n))
>> assuming some sort of tree structure) plus do a linear scan through the
>> unsorted entries (i.e., O(k), not O(k log(k))).
> 
> The current way git operates it always reads the whole index, making it necessary
> to merge the unsorted entries with the sorted part. Thinking about it it would even
> be O(k log(n)), because the appended part is unsorted.
> O(log(n)) + O(k) would be the complexity for loading only a single entry from the
> index.

I was confused because in your original mail you seemed to claim that
reading the whole sorted part of the index scales like O(log(n)), where
it certainly scales at least like O(n).

To read the whole index if using an append-only data structure, I would
do the following:

1. Read the file header to find where the addenda begin: one seek plus O(1).
2. Read the addenda in order (I assume each addendum to be sorted on
disk), and merge-sort the addenda together, discarding the earlier of
any duplicates: one seek plus O(k) I/O plus O(k log k) computation (this
is the worst case, if each addendum contains a single file).
3. Read the sorted part of the file in order, while merging it together
with the combined addenda: one seek plus O(n) I/O plus O(n + k) computation.

Total: 3 seeks plus O(n+k) I/O plus O(n + k log(k)) computation.

Whereas for a B-tree, it is hard to estimate the complexity because you
have provided very little detail about how you want to lay the data
structure out on disk.  But presumably the number of seeks will be
significantly larger.  And if you are not careful, the number of seeks
will approach the number of nodes in the index O(n) or perhaps the
number of added nodes (not files!) which could go something like O(k
log(k)).

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html