On Mon, Jan 06, 2025 at 02:17:58PM -0500, Taylor Blau wrote: > On Fri, Jan 03, 2025 at 02:08:01PM +0100, Patrick Steinhardt wrote: > > On Mon, Dec 30, 2024 at 12:22:34PM -0500, Taylor Blau wrote: > > > On Mon, Dec 30, 2024 at 03:24:02PM +0100, Patrick Steinhardt wrote: > > > > diff --git a/builtin/fast-import.c b/builtin/fast-import.c > > > > index 1fa2929a01b7dfee52b653248bba802884f6be6a..0f86392761abbe6acb217fef7f4fe7c3ff5ac1fa 100644 > > > > --- a/builtin/fast-import.c > > > > +++ b/builtin/fast-import.c > > > > @@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark) > > > > || (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size) > > > > cycle_packfile(); > > > > > > > > - the_hash_algo->init_fn(&checkpoint.ctx); > > > > + the_hash_algo->unsafe_init_fn(&checkpoint.ctx); > > > > > > This will obviously fix the issue at hand, but I don't think this is any > > > less brittle than before. The hash function implementation here needs to > > > agree with that used in the hashfile API. This change makes that > > > happen, but only using side information that the hashfile API uses the > > > unsafe variants. > > > > Yup, I only cared about fixing the segfault because we're close to the > > v2.48 release. I agree that the overall state is still extremely brittle > > right now. > > > > [snip] > > > I think we should perhaps combine forces here. My ideal end-state is to > > > have the unsafe_hash_algo() stuff land from my earlier series, then have > > > these two fixes (adjusted to the new world order as above), and finally > > > the Meson fixes after that. > > > > > > Does that seem like a plan to you? If so, I can put everything together > > > and send it out (if you're OK with me forging your s-o-b). > > > > I think the ideal state would be if the hashing function used was stored > > as part of `struct git_hash_ctx`. So the flow basically becomes for > > example: > > > > ``` > > struct git_hash_ctx ctx; > > struct object_id oid; > > > > git_hash_sha1_init(&ctx); > > git_hash_update(&ctx, data); > > git_hash_final_oid(&oid, &ctx); > > ``` > > > > Note how the intermediate calls don't need to know which hash function > > you used to initialize the `struct git_hash_ctx` -- the structure itself > > should remember what it has been initilized with and do the right thing. > > I'm not sure I'm following you here. In the stream_blob() function > within fast-import, the problem isn't that we're switching hash > functions mid-stream, but that we're initializing the hashfile_context > structure with the wrong hash function to begin with. True, but it would have been a non-issue if the hash context itself knew which hash function to use for updates. Sure, we would've used the slow variant of SHA1 instead of the fast-but-unsafe one. But that feels like the lesser evil compared to crashing. > You snipped it out of your reply, but I think that my suggestion to do: > > pack_file->algop->init_fn(&checkpoint.ctx); > > would harden us against the broken behavior we're seeing here. > > As a separate defense-in-depth measure, we could teach functions from > the hashfile API which deal with hashfile_checkpoint structure to ensure > that the hashfile and its checkpoint both use the same algorithm (by > adding a hash_algo field to the hashfile_checkpoint structure). I would think that it were even harder to abuse if it wasn't the hashfile API, but the hash API that remembered the algorithm. Patrick