> On Jun 20, 2023, at 3:32 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > >> // out needs to be zeroed first >> void unpack(struct uncompressed *out, const u64 *in, const struct >> bitblock *blocks, int nblocks) >> { >> u64 *out_as_words = (u64*)out; >> for (int i = 0; i < nblocks; i++) { >> const struct bitblock *b; >> out_as_words[b->target] |= (in[b->source] & b->mask) << >> b->shift; >> } >> } >> >> void apply_offsets(struct uncompressed *out, const struct uncompressed *offsets) >> { >> out->a += offsets->a; >> out->b += offsets->b; >> out->c += offsets->c; >> out->d += offsets->d; >> out->e += offsets->e; >> out->f += offsets->f; >> } >> >> Which generates nice code: https://godbolt.org/z/3fEq37hf5 > > Thinking about this a bit more, I think the only real performance issue with my code is that it does 12 read-xor-write operations in memory, which all depend on each other in horrible ways. If you compare the generated code, just notice that you forgot to initialize b in unpack() in this version. I presume you wanted it to say "b = &blocks[i]”.