io_uring (bdev_ioring) unstable on newer kernels ?

phandaal <phandaal@xxxxxxxxxxxx> · Mon, 06 Jun 2022 13:34:01 +0200

Hi all,

I have a problem with OSD getting thrown out from the cluster on very 
low activity, and then not being startable again without a reboot.

Context :
- single node home cluster, 15 OSDs (rotational 10/12To SAS drives), 
failure domain = OSD
- 8.3 EC cephFS data pool, 3/1 replicated cephFS metadata
- no WAL/RocksDB separate device
- Pacific 16.2.9 (all OSDs have been (re)created with Pacific some time 
ago due to hardware changes)
- Gentoo Linux, kernel 5.18.1
- I have some parameters set in ceph.conf :
   bluestore_default_buffered_write = true
   bluefs_buffered_io = true
   bdev_ioring = true

ceph status :
  cluster:
    id:     8c03b3b4-13cb-4bc1-9894-9ee71a3fe09a
    health: HEALTH_WARN
            noout,noscrub,nodeep-scrub flag(s) set
            Degraded data redundancy: 4581987/268329121 objects degraded 
(1.708%), 45 pgs degraded, 48 pgs undersized

  services:
    mon: 1 daemons, quorum r440 (age 13h)
    mgr: r440(active, since 13h)
    mds: 1/1 daemons up
    osd: 15 osds: 15 up (since 13h), 15 in (since 4w); 55 remapped pgs
         flags noout,noscrub,nodeep-scrub

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 161 pgs
    objects: 24.67M objects, 74 TiB
    usage:   100 TiB used, 60 TiB / 160 TiB avail
    pgs:     4581987/268329121 objects degraded (1.708%)
             1403132/268329121 objects misplaced (0.523%)
             106 active+clean
             45  active+undersized+degraded+remapped+backfill_wait
             7   active+remapped+backfill_wait
             2   active+undersized+remapped+backfilling
             1   active+undersized+remapped+backfill_wait

  io:
    recovery: 95 MiB/s, 28 objects/s

Here are the steps to the problem :
- cluster is rebuilding an OSD at moderate speed (2 backfills, 20 
requests, 0.1 sleep), reading  at 8-16Mo/s and writing at 16Mo/s on the 
rebuilding OSD. disk utilisation is stable around 50%, OSD latencies 
approx 2ms, stable too. I tried slowing down the backfilling with no 
effect (just 1 backfill, less requests).
- just using the storage (for example writing a single 700MB file), or 
doing nothing at all, just letting the rebuild process working. It 
happens on writes, not on reads, reading several files at full speed 
does not trigger the problem.

Suddenly, one of the OSDs (can happen with any of them) throwns out lots 
of :
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1bfbec640' had 
timed out after 15.000000954s
And getting finally killed with :
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1be3e9640' had 
suicide timed out after 150.000000000s
the ceph-osd process is killed and the OSD is marked down.

Cluster is set noout, so I try to restart the process (systemctl start 
ceph-osd@7.service), but the device is blocked :
2022-06-05T18:26:42.282+0200 7f22f4cf30c0  0 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) ioctl(F_SET_FILE_RW_HINT) on 
/var/lib/ceph/osd/ceph-7/block failed: (22) Invalid argument
2022-06-05T18:26:42.282+0200 7f22f4cf30c0  1 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on 
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.382+0200 7f22f4cf30c0  1 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on 
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.482+0200 7f22f4cf30c0  1 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on 
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.592+0200 7f22f4cf30c0  1 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on 
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.592+0200 7f22f4cf30c0 -1 bdev(0x5565a4c2d000 
/var/lib/ceph/osd/ceph-7/block) open failed to lock 
/var/lib/ceph/osd/ceph-7/block: (11) Resource temporarily unavailable
2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1 osd.7 0 OSD:init: unable to 
mount object store
2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1  ** ERROR: osd init failed: 
(11) Resource temporarily unavailable

complete osd log is available on :
https://phandaal.net/nextcloud/s/cLpwykMmPtp6pJo

In kernel logs I get :
Jun 05 18:26:32 r440 kernel: libceph: osd7 down

and right after :
Jun 05 18:31:31 r440 kernel: ------------[ cut here ]------------
Jun 05 18:31:31 r440 kernel: WARNING: CPU: 25 PID: 58420 at 
fs/io_uring.c:10091 io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Modules linked in:
Jun 05 18:31:31 r440 kernel: CPU: 25 PID: 58420 Comm: kworker/u66:5 
Tainted: G          I       5.18.1-gentoo-r1 #1
Jun 05 18:31:31 r440 kernel: Hardware name: Dell Inc. PowerEdge 
R440/04JN2K, BIOS 2.13.3 12/17/2021
Jun 05 18:31:31 r440 kernel: Workqueue: events_unbound io_ring_exit_work
Jun 05 18:31:31 r440 kernel: RIP: 0010:io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Code: 89 e2 48 c7 c6 4a 6b c7 85 e8 91 44 
21 ff 4c 89 ef e8 00 e3 ff ff 4c 89 e7 e8 99 fe ff ff 48 8b 05 ba f0 d8 
00 49 39 c6 79 08 <0f> 0b 41 bf 70 17 00 00 48 8d 7b 30 4c 89 fe e8 31 
64 09 00 48 85
Jun 05 18:31:31 r440 kernel: RSP: 0018:ffff9eaba017fe10 EFLAGS: 00010293
Jun 05 18:31:31 r440 kernel: RAX: 0000000100a6b928 RBX: ffff8c680b7dbbc0 
RCX: 0000000000000000
Jun 05 18:31:31 r440 kernel: RDX: 0000000000000000 RSI: 0000000000000000 
RDI: ffff8c680b7db840
Jun 05 18:31:31 r440 kernel: RBP: ffff9eaba017fe98 R08: 0000000000000000 
R09: 0000000000000001
Jun 05 18:31:32 r440 kernel: R10: 0000000000000000 R11: 000000000000000e 
R12: ffff8c680b7db800
Jun 05 18:31:32 r440 kernel: R13: 0000000000000000 R14: 0000000100a6b926 
R15: 0000000000000005
Jun 05 18:31:32 r440 kernel: FS:  0000000000000000(0000) 
GS:ffff8c777fd00000(0000) knlGS:0000000000000000
Jun 05 18:31:32 r440 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Jun 05 18:31:32 r440 kernel: CR2: 000055d46e1ce000 CR3: 0000001084724001 
CR4: 00000000007706e0
Jun 05 18:31:32 r440 kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Jun 05 18:31:32 r440 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 
DR7: 0000000000000400
Jun 05 18:31:32 r440 kernel: PKRU: 55555554
Jun 05 18:31:32 r440 kernel: Call Trace:
Jun 05 18:31:32 r440 kernel:  <TASK>
Jun 05 18:31:32 r440 kernel:  ? __schedule+0x310/0x1230
Jun 05 18:31:32 r440 kernel:  ? flush_to_ldisc+0x26/0xf0
Jun 05 18:31:32 r440 kernel:  process_one_work+0x1d5/0x3a0
Jun 05 18:31:32 r440 kernel:  ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel:  worker_thread+0x4d/0x3c0
Jun 05 18:31:32 r440 kernel:  ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel:  kthread+0xe5/0x110
Jun 05 18:31:32 r440 kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 05 18:31:32 r440 kernel:  ret_from_fork+0x1f/0x30
Jun 05 18:31:32 r440 kernel:  </TASK>
Jun 05 18:31:32 r440 kernel: ---[ end trace 0000000000000000 ]---

The error seems related to io_uring usage (I have bdev_ioring set to 
true).

the kernel thread is blocked in-kernel afterwards (state D) :
root      58420  0.0  0.0      0     0 ?        D    18:23   0:00  \_ 
[kworker/u66:5+events_unbound]

What is strange is that the device itself (/dev/sdo in this case) is 
perfectly available:
- I can smartctl -a /dev/sdo to look at smart logs, no hardware problems 
reported
- I can read from it (dd if=/dev/sdo works perfectly and gets correct 
data)
- I can unmount its tmpfs (umount /var/lib/ceph/osd/ceph-7)
- I can reactivate the osd (ceph-volume lvm activate 7 
95813126-23fd-471c-84eb-5c002c62471a gets ceph-volume lvm activate 
successful for osd ID: 7, and the tmpfs is mounted with correct content, 
but OSD still refuses to start)
- I can take out the drive from the bay, put it in again, it gets 
detected Ok (as another drive, not on the same /dev/sd?) but then I get 
into lvm hell due to the PV/VG/LVs being detected as duplicate and the 
PV/VG/LVs being not removable on the old (blocked) device. But I can 
without problem destroy the old OSD and recreate it on the 'new' device 
(this is why it is rebuilding currently, I did not try reboot at first).

It seems to confirm that the problem is due to a the use of another 
kernel path to access the device, in this case io_uring for bluestore 
data, and the standard I/O for other accesses, only one of the paths 
being blocked.

After a reboot, the OSD is perfectly started again, marked UP without 
problem, content is intact (or so it seems, I have yet to scrub all this 
after rebuild is finished).

I had seen this error on a 5.14.7 kernel previously, but it happened 
very rarely (once in several months of usage, I swapped out/in the drive 
and rebuilt it without problem, no reboot).
It now happens several times _per day_ (kernel 5.18.0 and 5.18.1).

I have disabled io_uring (ceph config set osd bdev_ioring false) and 
everything seems back to normal (tested with several write loads without 
OSDs getting kicked out).

Is ceph with io_uring tested on recent kernels ?
Are there any other parameters I can try to keep io_uring working (not 
really necessary on my home cluster but at work on SSD based clusters 
io_uring is showing great potential for better performance).
I did not try bdev_ioring_hipri nor bdev_ioring_sqthread_poll, is it 
necessary to set them to true when using io_uring ?
Would Quincy make a difference ?

Regards,
Christian.

--
Christian Vilhelm : phandaal@xxxxxxxxxxxx
Reality is for people who lack imagination
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx