On Fri, Mar 30, 2012 at 4:06 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Is it really the time consuming part in the overall picture? Do you have > vital statistics for various pieces to justify that you are solving the > right problem? E.g. (not exhaustive) > > - read_index(): overhead to > - validate the checksum of the whole index; > - extract entries into the_index.cache[] array; > > - write_index(): overhead to > - serialize entries into the on-disk format; > - compute the checksum over the entire index; > - write the whole index to disk; > > - frequency of the above two operations. Also maybe the frequency of entry updates vs additions/removals. I suspect refresh operation in some case can update a lot of entries. If that's the case (and happens often), we may need special treatment for it because simply appending entries might be costly. > Also, optimizing the write codepath by penalizing the read codepath is a > wrong trade-off if the index is read far more often than it is written > during a single day of a developer. I suspect so too, but some measurement has to be done there. It'd be good if you provide a patch to collect index operation statistics. Some of us can try it on for a few weeks. That would give us a better picture. On Thu, Mar 29, 2012 at 10:21 PM, Thomas Gummerer <t.gummerer@xxxxxxxxx> wrote: > - Tree structure > The advantage of the tree structure over the append-only data structure that was > proposed would be that it would not require any merge or other maintainance work > work after a change of a file. The disadvantage however is that changing a file > would always require log(n) changes to the index, where n is the number of > entries in the index. Another problem might be the integrity check, which would > need either to use a hash over the whole index file or a hash on the tree, > which would however take more time for checking. I'd say it takes less time for checksuming because we only verify the trees we read. And tree-based structure allows us to read just a subdirectory, an advantage for multi-project repositories where people stay in a subdir most of the time. Shawn suspected crc32 checksum over a large chunk of data may be insufficient. By hashing tree by tree, the chunks are significantly shorter, making crc32 viable and cheap candidate compared to sha-1 (also note that crc32 is used to verify compressed objects in a pack) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html