Since there's been a lot of questions recently about the state of the NewHash work, I thought I'd send out a summary. == Status I have patches to make the entire codebase work, including passing all tests, when Git is converted to use a 256-bit hash algorithm. Obviously, such a Git is incompatible with the current version, but it means that we've fixed essentially all of the hard-coded 20 and 40 constants (and therefore Git doesn't segfault). I'm working on getting a 256-bit Git to work with SHA-1 being the default. Currently, this involves doing things like writing transport code, since in order to clone a repository, you need to be able to set up the hash algorithm correctly. I know that this was a non-goal in the transition plan, but since the testsuite doesn't pass without it, it's become necessary. Some of these patches will be making their way to the list soon. They're hanging out in the normal places in the object-id-part14 branch (which may be rebased). == Future Design The work I've done necessarily involves porting everything to use the_hash_algo. Essentially, when the piece I'm currently working on is complete, we'll have a transition stage 4 implementation (all NewHash). Stage 2 and 3 will be implemented next. My vision of how data is stored is that the .git directory is, except for pack indices and the loose object lookup table, entirely in one format. It will be all SHA-1 or all NewHash. This algorithm will be stored in the_hash_algo. I plan on introducing an array of hash algorithms into struct repository (and wrapper macros) which stores, in order, the output hash, and if used, the additional input hash. Functions like get_oid_hex and parse_oid_hex will acquire an internal version, which knows about parsing things (like refs) in the internal format, and one which knows about parsing in the UI formats. Similarly, oid_to_hex will have an internal version that handles data in the .git directory, and an external version that produces data in the output format. Translation will take place at the outer edges of the program. The transition plan anticipates a stage 1 where accept only SHA-1 on input and produce only SHA-1 on output, but store in NewHash. As I've worked with our tests, I've realized such an implementation is not entirely possible. We have various tools that expect to accept invalid object IDs, and obviously there's no way to have those continue to work. We'd have to either reject invalid data in such a case or combine stages 1 and 2. == Compatibility with this Work If you're working on new features and you'd like to implement the best possible compatibility with this work, here are some recommendations: * Assume everything in the .git directory but pack indices and the loose object index will be in the same algorithm and that that algorithm is the_hash_algo. * For the moment, use the_hash_algo to look up the size of all hash-related constants. Use GIT_MAX_* for allocations. * If you are writing a new data format, add a version number. * If you need to serialize an algorithm identifier into your data format, use the format_id field of struct git_hash_algo. It's designed specifically for that purpose. * You can safely assume that the_hash_algo will be suitably initialized to the correct algorithm for your repository. * Keep using the object ID functions and struct object_id. * Try not to use mmap'd structs for reading and writing formats on disk, since these are hard to make hash size agnostic. == Discussion about an Actual NewHash Since I'll be writing new code, I'll be writing tests for this code. However, writing tests for creating and initializing repositories requires that I be able to test that objects are being serialized correctly, and therefore requires that I actually know what the hash algorithm is going to be. I also can't submit code for multi-hash packs when we officially only support one hash algorithm. I know that we have long tried to avoid discussing the specific algorithm to use, in part because the last discussion generated more heat than light, and settled on referring to it as NewHash for the time being. However, I think it's time to pick this topic back up, since I can't really continue work in this direction without us picking a NewHash. If people are interested, I've done some analysis on availability of implementations, performance, and other attributes described in the transition plan and can send that to the list. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc
Description: PGP signature