Re: io_uring failure on parisc (32-bit userspace and 64-bit kernel)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/12/23 3:31?PM, Helge Deller wrote:
> On 2/12/23 23:20, Helge Deller wrote:
>> On 2/12/23 22:48, Jens Axboe wrote:
>>> On 2/12/23 1:01?PM, Helge Deller wrote:
>>>> On 2/12/23 20:42, Jens Axboe wrote:
>>>>> On 2/12/23 12:35?PM, Helge Deller wrote:
>>>>>> On 2/12/23 15:03, Helge Deller wrote:
>>>>>>> On 2/12/23 14:35, Jens Axboe wrote:
>>>>>>>> On 2/12/23 6:28?AM, Helge Deller wrote:
>>>>>>>>> On 2/12/23 14:16, Jens Axboe wrote:
>>>>>>>>>> On 2/12/23 2:47?AM, Helge Deller wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> We see io-uring failures on the parisc architecture with this testcase:
>>>>>>>>>>> https://github.com/axboe/liburing/blob/master/examples/io_uring-test.c
>>>>>>>>>>>
>>>>>>>>>>> parisc is always big-endian 32-bit userspace, with either 32- or 64-bit kernel.
>>>>>>>>>>>
>>>>>>>>>>> On a 64-bit kernel (6.1.11):
>>>>>>>>>>> deller@parisc:~$ ./io_uring-test test.file
>>>>>>>>>>> ret=0, wanted 4096
>>>>>>>>>>> Submitted=4, completed=1, bytes=0
>>>>>>>>>>> -> failure
>>>>>>>>>>>
>>>>>>>>>>> strace shows:
>>>>>>>>>>> io_uring_setup(4, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=4, cq_entries=8, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|0x1f80, sq_off={head=0, tail=16, ring_mask=64, ring_entries=72, flags=84, dropped=80, array=224}, cq_off={head=32, tail=48, ring_mask=68, ring_entries=76, overflow=92, cqes=96, flags=0x58 /* IORING_CQ_??? */}}) = 3
>>>>>>>>>>> mmap2(NULL, 240, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0) = 0xf7522000
>>>>>>>>>>> mmap2(NULL, 256, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0x10000000) = 0xf6922000
>>>>>>>>>>> openat(AT_FDCWD, "libell0-dbgsym_0.56-2_hppa.deb", O_RDONLY|O_DIRECT) = 4
>>>>>>>>>>> statx(4, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=689308, ...}) = 0
>>>>>>>>>>> getrandom("\x5c\xcf\x38\x2d", 4, GRND_NONBLOCK) = 4
>>>>>>>>>>> brk(NULL)                               = 0x4ae000
>>>>>>>>>>> brk(0x4cf000)                           = 0x4cf000
>>>>>>>>>>> io_uring_enter(3, 4, 0, 0, NULL, 8)     = 0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Running the same testcase on a 32-bit kernel (6.1.11) works:
>>>>>>>>>>> root@debian:~# ./io_uring-test test.file
>>>>>>>>>>> Submitted=4, completed=4, bytes=16384
>>>>>>>>>>> -> ok.
>>>>>>>>>>>
>>>>>>>>>>> strace:
>>>>>>>>>>> io_uring_setup(4, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=4, cq_entries=8, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|0x1f80, sq_off={head=0, tail=16, ring_mask=64, ring_entries=72, flags=84, dropped=80, array=224}, cq_off={head=32, tail=48, ring_mask=68, ring_entries=76, overflow=92, cqes=96, flags=0x58 /* IORING_CQ_??? */}}) = 3
>>>>>>>>>>> mmap2(NULL, 240, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0) = 0xf6d4c000
>>>>>>>>>>> mmap2(NULL, 256, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0x10000000) = 0xf694c000
>>>>>>>>>>> openat(AT_FDCWD, "trace.dat", O_RDONLY|O_DIRECT) = 4
>>>>>>>>>>> statx(4, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=1855488, ...}) = 0
>>>>>>>>>>> getrandom("\xb2\x3f\x0c\x65", 4, GRND_NONBLOCK) = 4
>>>>>>>>>>> brk(NULL)                               = 0x15000
>>>>>>>>>>> brk(0x36000)                            = 0x36000
>>>>>>>>>>> io_uring_enter(3, 4, 0, 0, NULL, 8)     = 4
>>>>>>>>>>>
>>>>>>>>>>> I'm happy to test any patch if someone has an idea....
>>>>>>>>>>
>>>>>>>>>> No idea what this could be, to be honest. I tried your qemu vm image,
>>>>>>>>>> and it does boot, but it's missing keys to be able to update apt and
>>>>>>>>>> install packages... After fiddling with this for 30 min I gave up, any
>>>>>>>>>> chance you can update the sid image? Given how slow this thing is
>>>>>>>>>> running, it'd take me all day to do a fresh install and I have to admit
>>>>>>>>>> I'm not THAT motivated about parisc to do that :)
>>>>>>>>>
>>>>>>>>> Yes, I will update that image, but qemu currently only supports a
>>>>>>>>> 32-bit PA-RISC CPU which can only run the 32-bit kernel. So even if I
>>>>>>>>> update it, you won't be able to reproduce it, as it only happens with
>>>>>>>>> the 64-bit kernel. I'm sure it's some kind of missing 32-to-64bit
>>>>>>>>> translation in the kernel, which triggers only big-endian machines.
>>>>>>>>
>>>>>>>> I built my own kernel for it, so that should be fine, correct?
>>>>>>>
>>>>>>> No, as qemu won't boot the 64-bit kernel.
>>>>>>>
>>>>>>>> We'll see soon enough, managed to disable enough checks on the
>>>>>>>> debian-10 image to actually make it install packages.
>>>>>>>>
>>>>>>>>> Does powerpc with a 64-bit ppc64 kernel work?
>>>>>>>>> I'd assume it will show the same issue.
>>>>>>>>
>>>>>>>> No idea... Only stuff I use and test on is x86-64/32 and arm64.
>>>>>>>
>>>>>>> Would be interesting if someone could test...
>>>>>>>
>>>>>>>>> I will try to add some printks and compare the output of 32- and
>>>>>>>>> 64-bit kernels. If you have some suggestion where to add such (which?)
>>>>>>>>> debug code, it would help me a lot.
>>>>>>>>
>>>>>>>> I'd just try:
>>>>>>>>
>>>>>>>> echo 1 > /sys/kernel/debug/tracing/events/io_uring
>>>>>>>
>>>>>>> I'll try, but will take some time...
>>>>>>>
>>>>>>
>>>>>> At entry of io_submit_sqes(), io_sqring_entries() returns 0, because
>>>>>> ctx->rings->sq.tail is 0 (wrongly on broken 64-bit, but ok value 4 on 32-bit), and
>>>>>> ctx->cached_sq_head is 0 in both cases.
>>>>>
>>>>> cached_sq_head will get updated as sqes are consumed, but since sq.tail
>>>>> is zero, there's nothing to submit as far as io_uring is concerned.
>>>>>
>>>>> Can you dump addresses/offsets of the sq and cq heads/tails in userspace
>>>>> and in the kernel? They are u32, so same size of 32 and 64-bit.
>>>>
>>>> For both kernels (32- and 64-bit) I get:
>>>> p->sq_off.head = 0  p->sq_off.tail = 16
>>>> p->cq_off.head = 32  p->cq_off.tail = 48
>>>
>>> So all that looks as expected. Is it perhaps some mmap thing on 64-bit
>>> kernels? The kernel isn't seeing the updates. You could add the below
>>> debugging, and keep your kernel side stuff. Sounds like they don't quite
>>> agree.
>>>
>>>
>>> diff --git a/examples/io_uring-test.c b/examples/io_uring-test.c
>>> index 1a685360bff6..f1cfda90c018 100644
>>> --- a/examples/io_uring-test.c
>>> +++ b/examples/io_uring-test.c
>>> @@ -73,7 +73,9 @@ int main(int argc, char *argv[])
>>>               break;
>>>       } while (1);
>>>
>>> +    printf("pre-submit sq head/tail %d/%d, %d/%d\n", *ring.sq.khead, *ring.sq.ktail, ring.sq.sqe_head, ring.sq.sqe_tail);
>>>       ret = io_uring_submit(&ring);
>>> +    printf("post-submit sq head/tail %d/%d, %d/%d\n", *ring.sq.khead, *ring.sq.ktail, ring.sq.sqe_head, ring.sq.sqe_tail);
>>>       if (ret < 0) {
>>>           fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
>>>           return 1;
>>
>> Result is:
>> pre-submit sq head/tail 0/0, 0/4
>> ..
>> post-submit sq head/tail 0/4, 4/4
> 
> I have to correct myself!
> The problem exists on both, 32- and 64-bit kernels.
> My current testing with 32-bit kernel was on qemu, but after booting the same kernel
> on a physical box, I see the testcase failing on the 32-bit kernel too.
> 
> So, probably some cache-flushing / alias-handling is needed....
> This is a 1-CPU box, so SMP isn't involved.

Yep sounds like it. What's the caching architecture of parisc? Something
like this perhaps, but may not be complete as we'd need to do something
for the cqe writes too I think, not just the cq tail.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index db623b3185c8..ab0d1297bb0e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2338,6 +2338,8 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 	unsigned head, mask = ctx->sq_entries - 1;
 	unsigned sq_idx = ctx->cached_sq_head++ & mask;
 
+	flush_dcache_page(virt_to_page(ctx->sq_array + sq_idx));
+
 	/*
 	 * The cached sq head (or cq tail) serves two purposes:
 	 *
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index ab4b2a1c3b7e..b132f44a9364 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -220,6 +220,7 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
 {
 	/* order cqe stores with ring update */
 	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
+	flush_dcache_page(virt_to_page(ctx->rings));
 }
 
 /* requires smb_mb() prior, see wq_has_sleeper() */

-- 
Jens Axboe




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux