On Sat, Jun 9, 2018 at 10:57 PM brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx> wrote: > > Since there's been a lot of questions recently about the state of the > NewHash work, I thought I'd send out a summary. > > == Status > > I have patches to make the entire codebase work, including passing all > tests, when Git is converted to use a 256-bit hash algorithm. > Obviously, such a Git is incompatible with the current version, but it > means that we've fixed essentially all of the hard-coded 20 and 40 > constants (and therefore Git doesn't segfault). This is so cool! > == Future Design > > The work I've done necessarily involves porting everything to use > the_hash_algo. Essentially, when the piece I'm currently working on is > complete, we'll have a transition stage 4 implementation (all NewHash). > Stage 2 and 3 will be implemented next. > > My vision of how data is stored is that the .git directory is, except > for pack indices and the loose object lookup table, entirely in one > format. It will be all SHA-1 or all NewHash. This algorithm will be > stored in the_hash_algo. > > I plan on introducing an array of hash algorithms into struct repository > (and wrapper macros) which stores, in order, the output hash, and if > used, the additional input hash. I'm actually thinking that putting the_hash_algo inside struct repository is a mistake. We have code that's supposed to work without a repo and it shows this does not really make sense to forcefully use a partially-valid repo. Keeping the_hash_algo a separate variable sounds more elegant. > If people are interested, I've done some analysis on availability of > implementations, performance, and other attributes described in the > transition plan and can send that to the list. I quickly skimmed through that document. I have two more concerns that are less about any specific hash algorithm: - how does larger hash size affects git (I guess you covered cpu aspect, but what about cache-friendliness, disk usage, memory consumption) - how does all the function redirection (from abstracting away SHA-1) affects git performance. E.g. hashcmp could be optimized and inlined by the compiler. Now it still probably can optimize the memcmp(,,20), but we stack another indirect function call on top. I guess I might be just paranoid and this is not a big deal after all. -- Duy