Hello We have a cluster of 10 ceph servers. On that cluster there are EC pool with replicated SSD cache tier,
used by OpenStack Cinder for volumes storage for production
environment. From 2 days we observe messages like this in logs: 2017-07-05 10:50:13.451987 osd.114 [WRN] slow request
1165.927215 seconds old, received at 2017-07-05 10:30:47.104746:
osd_op(osd.130.50779:43441 11.57a05c54
rbd_data.5bc14d3135d111a.0000000000000084 [copy-get max 8388608]
snapc 0=[]
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e50881) currently waiting for rw locks in this example:
We've analyzed logs and found, that from the beginning RBD image
listed above [rbd_data.5bc14d3135d111a] causes problem
from very beginning. Virtual machine (OpenStack uses Ceph cluster
as backend storage for Cinder) is DOWN/STOPPED. Our conclusion is
that this means that problem lies on cluster, not client side. This unfortunately results in huge amount of blocked requests and RAM consumption. In result system restarts OSD daemon, and situation starts to repeat. We've tried to temporary down problematic OSD's, but problem propagate to different OSD pair. Using "ceph daemon osd.<ID> dump_ops_in_flight" on
problematic OSDS causes OSD to hangand in few minutes down by
cluster, with no response from command. Could anyone tell what does those log messages means ? Anyone had such a problem and could help to diagnose/repair ? Thanks for any help------------------------------------------------- |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com