Hi all,
I have a problem with OSD getting thrown out from the cluster on very
low activity, and then not being startable again without a reboot.
Context :
- single node home cluster, 15 OSDs (rotational 10/12To SAS drives),
failure domain = OSD
- 8.3 EC cephFS data pool, 3/1 replicated cephFS metadata
- no WAL/RocksDB separate device
- Pacific 16.2.9 (all OSDs have been (re)created with Pacific some time
ago due to hardware changes)
- Gentoo Linux, kernel 5.18.1
- I have some parameters set in ceph.conf :
bluestore_default_buffered_write = true
bluefs_buffered_io = true
bdev_ioring = true
ceph status :
cluster:
id: 8c03b3b4-13cb-4bc1-9894-9ee71a3fe09a
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
Degraded data redundancy: 4581987/268329121 objects degraded
(1.708%), 45 pgs degraded, 48 pgs undersized
services:
mon: 1 daemons, quorum r440 (age 13h)
mgr: r440(active, since 13h)
mds: 1/1 daemons up
osd: 15 osds: 15 up (since 13h), 15 in (since 4w); 55 remapped pgs
flags noout,noscrub,nodeep-scrub
data:
volumes: 1/1 healthy
pools: 3 pools, 161 pgs
objects: 24.67M objects, 74 TiB
usage: 100 TiB used, 60 TiB / 160 TiB avail
pgs: 4581987/268329121 objects degraded (1.708%)
1403132/268329121 objects misplaced (0.523%)
106 active+clean
45 active+undersized+degraded+remapped+backfill_wait
7 active+remapped+backfill_wait
2 active+undersized+remapped+backfilling
1 active+undersized+remapped+backfill_wait
io:
recovery: 95 MiB/s, 28 objects/s
Here are the steps to the problem :
- cluster is rebuilding an OSD at moderate speed (2 backfills, 20
requests, 0.1 sleep), reading at 8-16Mo/s and writing at 16Mo/s on the
rebuilding OSD. disk utilisation is stable around 50%, OSD latencies
approx 2ms, stable too. I tried slowing down the backfilling with no
effect (just 1 backfill, less requests).
- just using the storage (for example writing a single 700MB file), or
doing nothing at all, just letting the rebuild process working. It
happens on writes, not on reads, reading several files at full speed
does not trigger the problem.
Suddenly, one of the OSDs (can happen with any of them) throwns out lots
of :
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1bfbec640' had
timed out after 15.000000954s
And getting finally killed with :
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc1be3e9640' had
suicide timed out after 150.000000000s
the ceph-osd process is killed and the OSD is marked down.
Cluster is set noout, so I try to restart the process (systemctl start
ceph-osd@7.service), but the device is blocked :
2022-06-05T18:26:42.282+0200 7f22f4cf30c0 0 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) ioctl(F_SET_FILE_RW_HINT) on
/var/lib/ceph/osd/ceph-7/block failed: (22) Invalid argument
2022-06-05T18:26:42.282+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.382+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.482+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.592+0200 7f22f4cf30c0 1 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) _lock flock busy on
/var/lib/ceph/osd/ceph-7/block
2022-06-05T18:26:42.592+0200 7f22f4cf30c0 -1 bdev(0x5565a4c2d000
/var/lib/ceph/osd/ceph-7/block) open failed to lock
/var/lib/ceph/osd/ceph-7/block: (11) Resource temporarily unavailable
2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1 osd.7 0 OSD:init: unable to
mount object store
2022-06-05T18:26:44.732+0200 7f22f4cf30c0 -1 ** ERROR: osd init failed:
(11) Resource temporarily unavailable
complete osd log is available on :
https://phandaal.net/nextcloud/s/cLpwykMmPtp6pJo
In kernel logs I get :
Jun 05 18:26:32 r440 kernel: libceph: osd7 down
and right after :
Jun 05 18:31:31 r440 kernel: ------------[ cut here ]------------
Jun 05 18:31:31 r440 kernel: WARNING: CPU: 25 PID: 58420 at
fs/io_uring.c:10091 io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Modules linked in:
Jun 05 18:31:31 r440 kernel: CPU: 25 PID: 58420 Comm: kworker/u66:5
Tainted: G I 5.18.1-gentoo-r1 #1
Jun 05 18:31:31 r440 kernel: Hardware name: Dell Inc. PowerEdge
R440/04JN2K, BIOS 2.13.3 12/17/2021
Jun 05 18:31:31 r440 kernel: Workqueue: events_unbound io_ring_exit_work
Jun 05 18:31:31 r440 kernel: RIP: 0010:io_ring_exit_work+0xb9/0x533
Jun 05 18:31:31 r440 kernel: Code: 89 e2 48 c7 c6 4a 6b c7 85 e8 91 44
21 ff 4c 89 ef e8 00 e3 ff ff 4c 89 e7 e8 99 fe ff ff 48 8b 05 ba f0 d8
00 49 39 c6 79 08 <0f> 0b 41 bf 70 17 00 00 48 8d 7b 30 4c 89 fe e8 31
64 09 00 48 85
Jun 05 18:31:31 r440 kernel: RSP: 0018:ffff9eaba017fe10 EFLAGS: 00010293
Jun 05 18:31:31 r440 kernel: RAX: 0000000100a6b928 RBX: ffff8c680b7dbbc0
RCX: 0000000000000000
Jun 05 18:31:31 r440 kernel: RDX: 0000000000000000 RSI: 0000000000000000
RDI: ffff8c680b7db840
Jun 05 18:31:31 r440 kernel: RBP: ffff9eaba017fe98 R08: 0000000000000000
R09: 0000000000000001
Jun 05 18:31:32 r440 kernel: R10: 0000000000000000 R11: 000000000000000e
R12: ffff8c680b7db800
Jun 05 18:31:32 r440 kernel: R13: 0000000000000000 R14: 0000000100a6b926
R15: 0000000000000005
Jun 05 18:31:32 r440 kernel: FS: 0000000000000000(0000)
GS:ffff8c777fd00000(0000) knlGS:0000000000000000
Jun 05 18:31:32 r440 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Jun 05 18:31:32 r440 kernel: CR2: 000055d46e1ce000 CR3: 0000001084724001
CR4: 00000000007706e0
Jun 05 18:31:32 r440 kernel: DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
Jun 05 18:31:32 r440 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
Jun 05 18:31:32 r440 kernel: PKRU: 55555554
Jun 05 18:31:32 r440 kernel: Call Trace:
Jun 05 18:31:32 r440 kernel: <TASK>
Jun 05 18:31:32 r440 kernel: ? __schedule+0x310/0x1230
Jun 05 18:31:32 r440 kernel: ? flush_to_ldisc+0x26/0xf0
Jun 05 18:31:32 r440 kernel: process_one_work+0x1d5/0x3a0
Jun 05 18:31:32 r440 kernel: ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel: worker_thread+0x4d/0x3c0
Jun 05 18:31:32 r440 kernel: ? rescuer_thread+0x3c0/0x3c0
Jun 05 18:31:32 r440 kernel: kthread+0xe5/0x110
Jun 05 18:31:32 r440 kernel: ? kthread_complete_and_exit+0x20/0x20
Jun 05 18:31:32 r440 kernel: ret_from_fork+0x1f/0x30
Jun 05 18:31:32 r440 kernel: </TASK>
Jun 05 18:31:32 r440 kernel: ---[ end trace 0000000000000000 ]---
The error seems related to io_uring usage (I have bdev_ioring set to
true).
the kernel thread is blocked in-kernel afterwards (state D) :
root 58420 0.0 0.0 0 0 ? D 18:23 0:00 \_
[kworker/u66:5+events_unbound]
What is strange is that the device itself (/dev/sdo in this case) is
perfectly available:
- I can smartctl -a /dev/sdo to look at smart logs, no hardware problems
reported
- I can read from it (dd if=/dev/sdo works perfectly and gets correct
data)
- I can unmount its tmpfs (umount /var/lib/ceph/osd/ceph-7)
- I can reactivate the osd (ceph-volume lvm activate 7
95813126-23fd-471c-84eb-5c002c62471a gets ceph-volume lvm activate
successful for osd ID: 7, and the tmpfs is mounted with correct content,
but OSD still refuses to start)
- I can take out the drive from the bay, put it in again, it gets
detected Ok (as another drive, not on the same /dev/sd?) but then I get
into lvm hell due to the PV/VG/LVs being detected as duplicate and the
PV/VG/LVs being not removable on the old (blocked) device. But I can
without problem destroy the old OSD and recreate it on the 'new' device
(this is why it is rebuilding currently, I did not try reboot at first).
It seems to confirm that the problem is due to a the use of another
kernel path to access the device, in this case io_uring for bluestore
data, and the standard I/O for other accesses, only one of the paths
being blocked.
After a reboot, the OSD is perfectly started again, marked UP without
problem, content is intact (or so it seems, I have yet to scrub all this
after rebuild is finished).
I had seen this error on a 5.14.7 kernel previously, but it happened
very rarely (once in several months of usage, I swapped out/in the drive
and rebuilt it without problem, no reboot).
It now happens several times _per day_ (kernel 5.18.0 and 5.18.1).
I have disabled io_uring (ceph config set osd bdev_ioring false) and
everything seems back to normal (tested with several write loads without
OSDs getting kicked out).
Is ceph with io_uring tested on recent kernels ?
Are there any other parameters I can try to keep io_uring working (not
really necessary on my home cluster but at work on SSD based clusters
io_uring is showing great potential for better performance).
I did not try bdev_ioring_hipri nor bdev_ioring_sqthread_poll, is it
necessary to set them to true when using io_uring ?
Would Quincy make a difference ?
Regards,
Christian.
--
Christian Vilhelm : phandaal@xxxxxxxxxxxx
Reality is for people who lack imagination
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx