One object degraded cause all ceph requests hang - Jewel 10.2.6 (rbd + radosgw)

Vincent Godin <vince.mlist@xxxxxxxxx> · Thu, 4 Jan 2018 13:17:13 +0100

Yesterday we just encountered this bug. One OSD was looping on
"2018-01-03 16:20:59.148121 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 30.254269 seconds old, received at 2018-01-03
16:20:28.883837: osd_op(client.48285929.0:14601958 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected
e359833) currently waiting for degraded object".

The requests on this OSD.150 went quickly in blocked state

2018-01-03 16:25:56.241064 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 20 slow requests, 1 included below; oldest blocked for >
327.357139 secs
2018-01-03 16:30:19.299288 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 45 slow requests, 1 included below; oldest blocked for >
590.415387 secs
...
...
2018-01-03 16:46:04.900204 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 100 slow requests, 2 included below; oldest blocked for >
1204.060056 secs

while still looping

2018-01-03 16:46:04.900220 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 123.294762 seconds old, received at 2018-01-03
16:44:01.605320
: osd_op(client.48285929.0:14605228 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_complete_op] snapc 0=[] ack+ondis
k+write+known_if_redirected e359833) currently waiting for degraded object

All theses resquest were blocked on OSD.150.
A lot of VMs attached to Ceph were hanging.

The degraded object was
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 in the pg 35.2e.
This PG was located on 4 OSDs. The object has a 0 size on the 4 OSDs
It was not possible to do a ceph osd pg 35.2e query with a response.
Killing the OSD.150 lead to the requests bloqued on the new primary.

I found the relatively new bug #22072 which looks like mine but there
was no response from the ceph team. I finally tried the same solution
: rados rm -p pool/degraded_object but with no response from the
command. I stopped the command after 15 mn. Few minutes later, the 4
OSDs holding the pg 35.2e suddenly rebooted and the problem was
solved. The object was deleted on the 4 OSDs.

Anyway, it leads to a production break and i have no idea of what
produced the "degraded object" and i'm not sure if the solution came
from my command or from a inside process. At this time we are still
trying to repare some filesystems of the VMs attached to Ceph and i
have to explain that this all production break comes from one empty
object ... The real problem is why Ceph was unable to handle this
"degraded object" and looped on it, blocking all the requests on the
OSD.150 ?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com