Hi, On 10/11/2019 7:21 PM, René van Dorst wrote: > > ... > > I also wonder if we can also replace the "li $x, -4" and "and $x" with > "sll $x" > combination on other places like [0], also on line 1169? > > Replace this on line 1169, works on my device. > > - li $in0,-4 > srl $ctx,$tmp4,2 > - and $in0,$in0,$tmp4 > andi $tmp4,$tmp4,3 > + sll $in0, $ctx, 2 > addu $ctx,$ctx,$in0 The reason for why I chose to keep 'li $in0,-4' in poly1305_emit is because the original sequence has higher instruction-level parallelism. Yes, it's one extra instruction, but if all of them get paired, they will execute faster. Yes, it doesn't help single-issue processors such as yours, but thing is that next instruction depends on last, and then *formally* it's more appropriate to aim for higher ILP as general rule. Just in case, in poly1305_blocks is different, because dependent instruction does not immediately follow one that computes the residue. >> As for multiply-by-1-n-add. >> > > I wonder how many devices do exist with the "poor man" version. Well, it's not just how many devices, but more specifically how many of those will end up running the code in question. I would guess poor-man's unit would be found in ultra-low-power microcontroller, so... As implied, it's probably sufficient to keep this in mind just in case :-) Cheers.