On Mon, Dec 12, 2016 at 10:44 PM, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > #if defined(CONFIG_DCACHE_WORD_ACCESS) && BITS_PER_LONG == 64 > switch (left) { > case 0: break; > case 1: b |= data[0]; break; > case 2: b |= get_unaligned_le16(data); break; > case 4: b |= get_unaligned_le32(data); break; > default: > b |= le64_to_cpu(load_unaligned_zeropad(data) & > bytemask_from_count(left)); > break; > } > #else > switch (left) { > case 7: b |= ((u64)data[6]) << 48; > case 6: b |= ((u64)data[5]) << 40; > case 5: b |= ((u64)data[4]) << 32; > case 4: b |= get_unaligned_le32(data); break; > case 3: b |= ((u64)data[2]) << 16; > case 2: b |= get_unaligned_le16(data); break; > case 1: b |= data[0]; > } > #endif As it turns out, perhaps unsurprisingly, the code generation here is really not nice, resulting in many branches instead of a computed jump. I'll submit v3 with just a branch-less load_unaligned_zeropad for the 64-bit/dcache case and the duff's device for the other case. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html