Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes: > OK how about this. The general idea is preserve/extend current flat > index API and add a new (tree-based) one. Index users can use either. > They can even mix them up (which they do because we can't just flip > the API in one day for about 200 source files). > > The day that unpack_trees() is converted to tree API, I will declare > v5 victory ;) s/API, /API and benchmark says tree-shaped index is an overall win, /; > = Cleanup = > > struct cache_entry becomes partly opaque. ce_ctime..ce_gid are hidden > in -v2.c and -v5.c. We only expose ce_size, ce_flags, ce_namelen, name > and sha1 to index users. Extra v5 fields like ce_stat_crc, next and > dir_next are also hidden. These fields can be put in a real struct in > read-cache.h, which is supposedly included by -v2.c and -v5.c I do not particularly see a reason to keep different in-core cache_entry representations even in an early round of the API updates. If v2 needs ctime and gid and v5 needs crc, keep both fields for simplicity. When coming from the filesystem, ctime, gid and friends are immediately available and crc needs to be computed only immediately before it is written out or it is compared with an existing entry. I also do not see a reason to keep two representations of in-core index_state representations for that matter. The current code that access nth entry from the index->cache[nth] would need to be updated to use an accessor function, whether the "nth" comes from index_name_pos() or from the for-loop that iterates over the entire index. For the latter, you would need to give the users a function that returns a cursor into the in-core index to allow iterating over it. When you use an in-core representation that is not a flat array, the type of "nth", which is essentially a cursor, may have to change to something that is richer than a simple integer, in order to give the implementation of the in-core index a more efficient way to access the entry than traversing the leaves of the tree depth first, and you would need to update index_name_pos() to return such a "cursor". That design and development cost is part of updating the in-core data structure. In the end result, the runtime cost to manipulate an index entry that the cursor refers to should be minimum, as that would be the cost paid by all the users of the API anyway, even if we _were_ starting from an ideal world where there weren't any flat in-core index in the first place. Because the v2 on-disk format forces us to scan the whole thing at least once, with a properly designed in-core representation, the overall system would not suffer performance penalty when reading from v2, as both the current code and the updated code have to read everything, and accesses based on the cursor given by either index_name_pos() or the index iterator has to be fast anyway (if the latter does not hold true, your updated in-core representation that is not a flat array needs to be rethought). On top of such a solid foundation, we can map the updated in-core representation to an on-disk representation with confidence, as any performance improvement or degradation from that point on must be solely attributable to the on-disk format difference. Without such a foundation, it is hard to justify a different on-disk format without handwaving, no? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html