Hi All,
Tonight we had an incident in our hyperconverged Proxmox cluster with
Ceph v16.2.6 .
At 04:56:08 OSD.12 in node1 started to report slow ops.
At 04:56.09 OSD.15 in node4 started to report slow ops.
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.98720 root default
-13 3.56488 host node1
2 hdd 0.90919 osd.2 up 1.00000 1.00000
3 hdd 0.90919 osd.3 up 1.00000 1.00000
12 ssd 1.74649 osd.12 up 1.00000 1.00000
-15 4.47406 host node2
0 hdd 0.90919 osd.0 up 1.00000 1.00000
4 hdd 0.90918 osd.4 up 1.00000 1.00000
9 hdd 0.90919 osd.9 up 1.00000 1.00000
13 ssd 1.74649 osd.13 up 1.00000 1.00000
-6 4.47408 host node3
8 hdd 0.90919 osd.8 up 1.00000 1.00000
10 hdd 0.90919 osd.10 up 1.00000 1.00000
11 hdd 0.90919 osd.11 up 1.00000 1.00000
14 ssd 1.74649 osd.14 up 1.00000 1.00000
-5 4.47418 host node4
1 hdd 0.90919 osd.1 up 1.00000 1.00000
6 hdd 0.90919 osd.6 up 1.00000 1.00000
7 hdd 0.90919 osd.7 up 1.00000 1.00000
15 ssd 1.74660 osd.15 up 1.00000 1.00000
Ceph network is 10Gbps. We have one SSD replicated pool (size=3,min=2),
and one HDD replicated pool (size=3,min=1). Autoscale is in warning
mode. Both pools are about 25% occupied.
We can see in node1 network inbound bandwith usage that peaked to about
83 Mb/s stable about that time.
We can see in node4 network outbound bandwith usage that peaked to about
85 Mb/s stable about that time.
Issue was resolved restarting OSD.15 . Within about a minute, bandwith
went back to normal and slow ops disappeared. Some VMs had to be
rebooted, but so far so good :)
It's like OSD.15 was sending data to OSD.12 but it was refused? :-)
Both OSD have a SSD backing disk, wal/db/data is in the same disk.
Format is bluestore for all SSD OSDs, and for some of the HDD.
No disk issues are reported in syslog nor dmesg for those two SSDs.
Logs for OSDs:
--- OSD.15 node4
2022-01-12T04:46:40.737+0100 7fd28344a700 4 rocksdb:
[db_impl/db_impl.cc:849] ------- DUMPING STATS -------
2022-01-12T04:46:40.737+0100 7fd28344a700 4 rocksdb:
[db_impl/db_impl.cc:851]
[... stripped stats ...]
2022-01-12T04:56:09.491+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.232963+0100 currently waiting for
sub ops
2022-01-12T04:56:09.491+0100 7fd28e460700 -1 osd.15 44032
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:10.499+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.232963+0100 currently waiting for
sub ops
2022-01-12T04:56:10.499+0100 7fd28e460700 -1 osd.15 44032
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:11.519+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.232963+0100 currently waiting for
sub ops
2022-01-12T04:56:11.519+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312662.0:8092600 7.4d
7:b36b6fe0:::rbd_data.9fe7186b8b4567.0000000000001c22:head [write
2367488~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:40.907230+0100 currently waiting for
sub ops
2022-01-12T04:56:11.519+0100 7fd28e460700 -1 osd.15 44032
get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:12.515+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.232963+0100 currently waiting for
sub ops
2022-01-12T04:56:12.515+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312662.0:8092600 7.4d
7:b36b6fe0:::rbd_data.9fe7186b8b4567.0000000000001c22:head [write
2367488~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:40.907230+0100 currently waiting for
sub ops
2022-01-12T04:56:12.515+0100 7fd28e460700 -1 osd.15 44032
get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:13.491+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976492 7.3b
7:dcd73c34:::rbd_data.a001436b8b4567.0000000000000062:head [write
1855488~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.232963+0100 currently waiting for
sub ops
2022-01-12T04:56:13.491+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312662.0:8092600 7.4d
7:b36b6fe0:::rbd_data.9fe7186b8b4567.0000000000001c22:head [write
2367488~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:40.907230+0100 currently waiting for
sub ops
2022-01-12T04:56:13.491+0100 7fd28e460700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312785.0:96260005 7.45
7:a37c8881:::rbd_data.a002336b8b4567.0000000000000475:head [write
3395584~53248 in=53248b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:43.436038+0100 currently waiting for
sub ops
[...]
--- OSD.12 node1
2022-01-12T04:53:55.440+0100 7f1f79d13700 0 log_channel(cluster) log
[DBG] : 7.71 scrub ok
2022-01-12T04:56:08.999+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612313472.0:156390670 7.56
7:6ad61629:::rbd_data.6cd97d6b8b4567.00000000000001a0:head [write
1015808~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:38.959449+0100 currently waiting for
sub ops
2022-01-12T04:56:08.999+0100 7f1f8f53e700 -1 osd.12 44032
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.612313472.0:156390670 7.56
7:6ad61629:::rbd_data.6cd97d6b8b4567.00000000000001a0:head [write
1015808~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:09.999+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612313472.0:156390670 7.56
7:6ad61629:::rbd_data.6cd97d6b8b4567.00000000000001a0:head [write
1015808~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:38.959449+0100 currently waiting for
sub ops
2022-01-12T04:56:09.999+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976478 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
1855488~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.231786+0100 currently waiting for
sub ops
2022-01-12T04:56:09.999+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976558 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
425984~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.233680+0100 currently waiting for
sub ops
2022-01-12T04:56:09.999+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976590 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
3506176~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.234269+0100 currently waiting for
sub ops
2022-01-12T04:56:09.999+0100 7f1f8f53e700 -1 osd.12 44032
get_health_metrics reporting 4 slow ops, oldest is
osd_op(client.612313472.0:156390670 7.56
7:6ad61629:::rbd_data.6cd97d6b8b4567.00000000000001a0:head [write
1015808~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected e44032)
2022-01-12T04:56:10.951+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612313472.0:156390670 7.56
7:6ad61629:::rbd_data.6cd97d6b8b4567.00000000000001a0:head [write
1015808~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:38.959449+0100 currently waiting for
sub ops
2022-01-12T04:56:10.951+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976478 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
1855488~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.231786+0100 currently waiting for
sub ops
2022-01-12T04:56:10.951+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976558 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
425984~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.233680+0100 currently waiting for
sub ops
2022-01-12T04:56:10.951+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612312938.0:57976590 7.73
7:ce752f82:::rbd_data.a001436b8b4567.0000000000000064:head [write
3506176~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:39.234269+0100 currently waiting for
sub ops
2022-01-12T04:56:10.951+0100 7f1f8f53e700 0 log_channel(cluster) log
[WRN] : slow request osd_op(client.612313472.0:156390838 7.77
7:efe2ab8b:::rbd_data.6cd97d6b8b4567.0000000000000299:head [write
3948544~16384 in=16384b] snapc 0=[] ondisk+write+known_if_redirected
e44032) initiated 2022-01-12T04:55:40.502724+0100 currently waiting for
sub ops
[...]
Now everything is fine and stable, but this incident affected our service.
Does this issue seem something known and/or can we provide additional info?
OSD.15 was recreated about 40 days ago because another issue (DB
corruption, see another thread by me and Igor).
Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx