Hi Matthew,
Am 22.07.2023 um 05:03 schrieb Matthew Wilcox:
I thought it a little odd to use an unsigned char when we're testing
to see if it's negative, so I went with this:
static inline bool clear_bit_unlock_is_negative_byte(unsigned int nr,
volatile unsigned long *p)
{
char result;
char mask = 1 << nr; /* nr guaranteed to be < 7 */
char *cp = (char *)p + 3; /* m68k is big-endian */
__asm__ __volatile__ ("eori.b %1, %2; smi %0"
: "=d" (result)
: "i" (mask), "o" (*cp)
: "memory");
return result;
}
I don't think signedness make a difference here, as we don't leave
anything about the test to the compiler.
I'm sure you can do all the casting to char and increment by 3 in the asm
argument...
I'd rather not. I looked at doing the offset by three inside the asm,
but it seems like gcc is smart enough to do that without help:
000006e0 <folio_unlock>:
6e0: 206f 0004 moveal %sp@(4),%a0
6e4: 0a28 0001 0003 eorib #1,%a0@(3)
6ea: 5bc0 smi %d0
6ec: 4a00 tstb %d0
6ee: 670a beqs 6fa <folio_unlock+0x1a>
6f0: 42a7 clrl %sp@-
6f2: 2f08 movel %a0,%sp@-
6f4: 4eba fcec jsr %pc@(3e2 <folio_wake_bit>)
6f8: 508f addql #8,%sp
6fa: 4e75 rts
You'll note the smi/tstb pair are unnecessary. It could simply BPL to
the RTS instruction, but we can't tell GCC that because we don't have
the __GCC_ASM_FLAG_OUTPUTS__ feature.
I see. __GCC_ASM_FLAG_OUTPUTS__ appears to be rare enough to assume m68k
won't get to use it this decade.
By the way, before this optimisation, it was this:
000006fc <folio_unlock>:
6fc: 206f 0004 moveal %sp@(4),%a0
700: 08a8 0000 0003 bclr #0,%a0@(3)
706: 2010 movel %a0@,%d0
708: 4a00 tstb %d0
70a: 6c0a bges 716 <folio_unlock+0x1a>
70c: 42a7 clrl %sp@-
70e: 2f08 movel %a0,%sp@-
710: 4eba fcd0 jsr %pc@(3e2 <folio_wake_bit>)
714: 508f addql #8,%sp
716: 4e75 rts
which is the same number of instructions, but one more memory reference.
Seeing as a0 was just loaded from memory above, and should be in the
cache, I doubt there's much performance difference from the additional
memory access here. Haven't looked at how many clock cycles each of
these instructions take.
It's a read-after-write hazard, but I don't know if that affects any
m68k implementation; my impression is that even on an '060 there aren't
Unless something modifies the memory location pointed to by a0 (during
preemption of any kind), I can't see a race here (a0 is saved/restored
during interrupts, syscalls and such).
We haven't seen any races here in the past with this code AFAICS.
any real performance implications. Kudos to gcc for figuring out that
testing bit 7 can be done with the tstb instruction.
If there's a simple way to exercise this code path using standard Unix tools
(or stress-ng which I ought to have somewhere), drop me a hint.
Oh, it's so common to have a waiter on a folio unlock that just making
it to the login prompt is enough to declare comfidently that this works.
CPU implementations with memory barriers and such fanciness are a little
harder to be confident in, but this looks good to me. I generally run
xfstests, but that's just because I have it all set up and ready to go.
Should be all good then.
I'll drop your Tested-by on this if that's OK? If you want a
Co-developed-by credit, that's fine with me too!
Tested-by is fine, thanks. I wouldn't want anyone to assume I'm familiar
with folios and the like...
Cheers,
Michael