io_uring (bdev_ioring) unstable on newer kernels ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I have a problem with OSD getting thrown out from the cluster on very low activity, and then not being startable again without a reboot.

Context :
- single node home cluster, 15 OSDs (rotational 10/12To SAS drives), failure domain = OSD
- 8.3 EC cephFS data pool, 3/1 replicated cephFS metadata
- no WAL/RocksDB separate device
- Pacific 16.2.9 (all OSDs have been (re)created with Pacific some time ago due to hardware changes)
- Gentoo Linux, kernel 5.18.1
- I have some parameters set in ceph.conf :
   bluestore_default_buffered_write = true
   bluefs_buffered_io = true
   bdev_ioring = true

ceph status :
  cluster:
    id:     8c03b3b4-13cb-4bc1-9894-9ee71a3fe09a
    health: HEALTH_WARN
            noout,noscrub,nodeep-scrub flag(s) set
Degraded data redundancy: 4581987/268329121 objects degraded (1.708%), 45 pgs degraded, 48 pgs undersized

  services:
    mon: 1 daemons, quorum r440 (age 13h)
    mgr: r440(active, since 13h)
    mds: 1/1 daemons up
    osd: 15 osds: 15 up (since 13h), 15 in (since 4w); 55 remapped pgs
         flags noout,noscrub,nodeep-scrub

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 161 pgs
    objects: 24.67M objects, 74 TiB
    usage:   100 TiB used, 60 TiB / 160 TiB avail
    pgs:     4581987/268329121 objects degraded (1.708%)
             1403132/268329121 objects misplaced (0.523%)
             106 active+clean
             45  active+undersized+degraded+remapped+backfill_wait
             7   active+remapped+backfill_wait
             2   active+undersized+remapped+backfilling
             1   active+undersized+remapped+backfill_wait

  io:
    recovery: 95 MiB/s, 28 objects/s

Here are the steps to the problem :
- cluster is rebuilding an OSD at moderate speed (2 backfills, 20 requests, 0.1 sleep), reading at 8-16Mo/s and writing at 16Mo/s on the rebuilding OSD. disk utilisation is stable around 50%, OSD latencies approx 2ms, stable too. I tried slowing down the backfilling with no effect (just 1 backfill, less requests). - just using the storage (for example writing a single 700MB file), or doing nothing at all, just letting the rebuild process working. It happens on writes, not on reads, reading several files at full speed does not trigger the problem.

Suddenly, one of the OSDs (can happen with any of them) throwns out lots of : heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1bfbec640' had timed out after 15.000000954s
And getting finally killed with :
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1be3e9640' had suicide timed out after 150.000000000s
the ceph-osd process is killed and the OSD is marked down.

Cluster is set noout, so I try to restart the process (systemctl start ceph-osd@7.service), but the device is blocked : 2022-06-05T18:26:42.282+0200 7f22f4cf30c0 0 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-7/block failed: (22) Invalid argument 2022-06-05T18:26:42.282+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) _lock flock busy on /var/lib/ceph/osd/ceph-7/block 2022-06-05T18:26:42.382+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) _lock flock busy on /var/lib/ceph/osd/ceph-7/block 2022-06-05T18:26:42.482+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) _lock flock busy on /var/lib/ceph/osd/ceph-7/block 2022-06-05T18:26:42.592+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) _lock flock busy on /var/lib/ceph/osd/ceph-7/block 2022-06-05T18:26:42.592+0200 7f22f4cf30c0 -1 bdev(0x5565a4c2d000 /var/lib/ceph/osd/ceph-7/block) open failed to lock /var/lib/ceph/osd/ceph-7/block: (11) Resource temporarily unavailable 2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1 osd.7 0 OSD:init: unable to mount object store 2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1 ** ERROR: osd init failed: (11) Resource temporarily unavailable

complete osd log is available on :
https://phandaal.net/nextcloud/s/cLpwykMmPtp6pJo

In kernel logs I get :
Jun 05 18:26:32 r440 kernel: libceph: osd7 down

and right after :
Jun 05 18:31:31 r440 kernel: ------------[ cut here ]------------
Jun 05 18:31:31 r440 kernel: WARNING: CPU: 25 PID: 58420 at fs/io_uring.c:10091 io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Modules linked in:
Jun 05 18:31:31 r440 kernel: CPU: 25 PID: 58420 Comm: kworker/u66:5 Tainted: G I 5.18.1-gentoo-r1 #1 Jun 05 18:31:31 r440 kernel: Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.13.3 12/17/2021
Jun 05 18:31:31 r440 kernel: Workqueue: events_unbound io_ring_exit_work
Jun 05 18:31:31 r440 kernel: RIP: 0010:io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Code: 89 e2 48 c7 c6 4a 6b c7 85 e8 91 44 21 ff 4c 89 ef e8 00 e3 ff ff 4c 89 e7 e8 99 fe ff ff 48 8b 05 ba f0 d8 00 49 39 c6 79 08 <0f> 0b 41 bf 70 17 00 00 48 8d 7b 30 4c 89 fe e8 31 64 09 00 48 85
Jun 05 18:31:31 r440 kernel: RSP: 0018:ffff9eaba017fe10 EFLAGS: 00010293
Jun 05 18:31:31 r440 kernel: RAX: 0000000100a6b928 RBX: ffff8c680b7dbbc0 RCX: 0000000000000000 Jun 05 18:31:31 r440 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c680b7db840 Jun 05 18:31:31 r440 kernel: RBP: ffff9eaba017fe98 R08: 0000000000000000 R09: 0000000000000001 Jun 05 18:31:32 r440 kernel: R10: 0000000000000000 R11: 000000000000000e R12: ffff8c680b7db800 Jun 05 18:31:32 r440 kernel: R13: 0000000000000000 R14: 0000000100a6b926 R15: 0000000000000005 Jun 05 18:31:32 r440 kernel: FS: 0000000000000000(0000) GS:ffff8c777fd00000(0000) knlGS:0000000000000000 Jun 05 18:31:32 r440 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 05 18:31:32 r440 kernel: CR2: 000055d46e1ce000 CR3: 0000001084724001 CR4: 00000000007706e0 Jun 05 18:31:32 r440 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 05 18:31:32 r440 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 05 18:31:32 r440 kernel: PKRU: 55555554
Jun 05 18:31:32 r440 kernel: Call Trace:
Jun 05 18:31:32 r440 kernel:  <TASK>
Jun 05 18:31:32 r440 kernel:  ? __schedule+0x310/0x1230
Jun 05 18:31:32 r440 kernel:  ? flush_to_ldisc+0x26/0xf0
Jun 05 18:31:32 r440 kernel:  process_one_work+0x1d5/0x3a0
Jun 05 18:31:32 r440 kernel:  ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel:  worker_thread+0x4d/0x3c0
Jun 05 18:31:32 r440 kernel:  ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel:  kthread+0xe5/0x110
Jun 05 18:31:32 r440 kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 05 18:31:32 r440 kernel:  ret_from_fork+0x1f/0x30
Jun 05 18:31:32 r440 kernel:  </TASK>
Jun 05 18:31:32 r440 kernel: ---[ end trace 0000000000000000 ]---

The error seems related to io_uring usage (I have bdev_ioring set to true).

the kernel thread is blocked in-kernel afterwards (state D) :
root 58420 0.0 0.0 0 0 ? D 18:23 0:00 \_ [kworker/u66:5+events_unbound]

What is strange is that the device itself (/dev/sdo in this case) is perfectly available: - I can smartctl -a /dev/sdo to look at smart logs, no hardware problems reported - I can read from it (dd if=/dev/sdo works perfectly and gets correct data)
- I can unmount its tmpfs (umount /var/lib/ceph/osd/ceph-7)
- I can reactivate the osd (ceph-volume lvm activate 7 95813126-23fd-471c-84eb-5c002c62471a gets ceph-volume lvm activate successful for osd ID: 7, and the tmpfs is mounted with correct content, but OSD still refuses to start) - I can take out the drive from the bay, put it in again, it gets detected Ok (as another drive, not on the same /dev/sd?) but then I get into lvm hell due to the PV/VG/LVs being detected as duplicate and the PV/VG/LVs being not removable on the old (blocked) device. But I can without problem destroy the old OSD and recreate it on the 'new' device (this is why it is rebuilding currently, I did not try reboot at first).

It seems to confirm that the problem is due to a the use of another kernel path to access the device, in this case io_uring for bluestore data, and the standard I/O for other accesses, only one of the paths being blocked.

After a reboot, the OSD is perfectly started again, marked UP without problem, content is intact (or so it seems, I have yet to scrub all this after rebuild is finished).

I had seen this error on a 5.14.7 kernel previously, but it happened very rarely (once in several months of usage, I swapped out/in the drive and rebuilt it without problem, no reboot).
It now happens several times _per day_ (kernel 5.18.0 and 5.18.1).

I have disabled io_uring (ceph config set osd bdev_ioring false) and everything seems back to normal (tested with several write loads without OSDs getting kicked out).

Is ceph with io_uring tested on recent kernels ?
Are there any other parameters I can try to keep io_uring working (not really necessary on my home cluster but at work on SSD based clusters io_uring is showing great potential for better performance). I did not try bdev_ioring_hipri nor bdev_ioring_sqthread_poll, is it necessary to set them to true when using io_uring ?
Would Quincy make a difference ?

Regards,
Christian.

--
Christian Vilhelm : phandaal@xxxxxxxxxxxx
Reality is for people who lack imagination
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux