Yesterday we just encountered this bug. One OSD was looping on "2018-01-03 16:20:59.148121 7f011a6a1700 0 log_channel(cluster) log [WRN] : slow request 30.254269 seconds old, received at 2018-01-03 16:20:28.883837: osd_op(client.48285929.0:14601958 35.8abfc02e .dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected e359833) currently waiting for degraded object". The requests on this OSD.150 went quickly in blocked state 2018-01-03 16:25:56.241064 7f011a6a1700 0 log_channel(cluster) log [WRN] : 20 slow requests, 1 included below; oldest blocked for > 327.357139 secs 2018-01-03 16:30:19.299288 7f011a6a1700 0 log_channel(cluster) log [WRN] : 45 slow requests, 1 included below; oldest blocked for > 590.415387 secs ... ... 2018-01-03 16:46:04.900204 7f011a6a1700 0 log_channel(cluster) log [WRN] : 100 slow requests, 2 included below; oldest blocked for > 1204.060056 secs while still looping 2018-01-03 16:46:04.900220 7f011a6a1700 0 log_channel(cluster) log [WRN] : slow request 123.294762 seconds old, received at 2018-01-03 16:44:01.605320 : osd_op(client.48285929.0:14605228 35.8abfc02e .dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call rgw.bucket_complete_op] snapc 0=[] ack+ondis k+write+known_if_redirected e359833) currently waiting for degraded object All theses resquest were blocked on OSD.150. A lot of VMs attached to Ceph were hanging. The degraded object was .dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 in the pg 35.2e. This PG was located on 4 OSDs. The object has a 0 size on the 4 OSDs It was not possible to do a ceph osd pg 35.2e query with a response. Killing the OSD.150 lead to the requests bloqued on the new primary. I found the relatively new bug #22072 which looks like mine but there was no response from the ceph team. I finally tried the same solution : rados rm -p pool/degraded_object but with no response from the command. I stopped the command after 15 mn. Few minutes later, the 4 OSDs holding the pg 35.2e suddenly rebooted and the problem was solved. The object was deleted on the 4 OSDs. Anyway, it leads to a production break and i have no idea of what produced the "degraded object" and i'm not sure if the solution came from my command or from a inside process. At this time we are still trying to repare some filesystems of the VMs attached to Ceph and i have to explain that this all production break comes from one empty object ... The real problem is why Ceph was unable to handle this "degraded object" and looped on it, blocking all the requests on the OSD.150 ? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com