Hello Cephers,
I am trying to find the cause of multiple slow ops happened with my
small cluster. I have a 3 node with 9 OSDs
Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
128 GB RAM
Each OSD is SSD Intel DC-S3710 800GB
It runs mimic 13.2.2 in containers.
Cluster was operating normally for 4 month and then recently I had an
outage with multiple VMs (RBD) showing
Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243812] INFO:
task xfsaild/vda1:404 blocked for more than 120 seconds.
Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243957]
Not tainted 4.19.5-1.el7.elrepo.x86_64 #1
Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244063] "echo 0
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244181]
xfsaild/vda1 D 0 404 2 0x80000000
After examining ceph logs, i found following entries in multiple OSDs
Mar 8 07:38:52 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08
07:38:52.299 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1
slow ops, oldest is osd_op(client.148553.0:5996289 7.fe
7:7f0ebfe2:::rbd_data.17bab2eb141f2.000000000000023d:head [stat,write
2588672~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
Mar 8 07:38:53 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08
07:38:53.347 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1
slow ops, oldest is osd_op(client.148553.0:5996289 7.fe
7:7f0ebfe2:::rbd_data.17bab2eb141f2.00000000
Mar 8 07:43:05 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08
07:43:05.360 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 1
slow ops, oldest is osd_op(client.152215.0:7037343 7.1e
7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write
393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
Mar 8 07:43:06 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08
07:43:06.332 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 2
slow ops, oldest is osd_op(client.152215.0:7037343 7.1e
7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write
393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
The messages were showing on all nodes and affecting several osds on
each node.
The trouble started at approximately 07:30 am and end 30 minutes later.
I have not seen any slow ops since then, nor VMs showed kernel hangups
since then. Here is my ceph status. I also want to note that the load
on the cluster was minimal at the time. Please let me know where I
could start looking as the cluster cannot be in production with this
failures.
cluster:
id: 054890af-aef7-46cf-a179-adc9170e3958
health: HEALTH_OK
services:
mon: 3 daemons, quorum storage1n1-chi,storage1n2-chi,storage1n3-chi
mgr: storage1n3-chi(active), standbys: storage1n1-chi, storage1n2-chi
mds: cephfs-1/1/1 up {0=storage1n2-chi=up:active}, 2 up:standby
osd: 27 osds: 27 up, 27 in
rgw: 3 daemons active
data:
pools: 7 pools, 608 pgs
objects: 1.46 M objects, 697 GiB
usage: 3.0 TiB used, 17 TiB / 20 TiB avail
pgs: 608 active+clean
io:
client: 0 B/s rd, 91 KiB/s wr, 6 op/s rd, 10 op/s wr
Thank you in advance,
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com