Re: One object degraded cause all ceph requests hang - Jewel 10.2.6 (rbd + radosgw)

Vincent Godin <vince.mlist@xxxxxxxxx> · Thu, 11 Jan 2018 12:24:49 +0100

As no response were given, i will explain what i found : maybe it
could help other people

.dirXXXXXXX object is an index marker with a 0 data size. The metadata
associated to this object (located in the levelDB of the OSDs
currently holding this marker) is the index of the bucket
corresponding to this marker.
My problem came from the number of objects stored in this bucket :
more than 50 millions. As the size of an object in the index is
between 200 and 250 bytes, the index should have a 12 GB size. That's
why it is recommanded to add a shard the index for each 100.000
objects.
During a ceph process rebuild, some pgs move from some OSDs to others.
When a index is moving, all the write requests to the bucket are
blocked till the operation completed. During this move, the user had
launched an upload batch on the bucket so a lot of requests were
blocked, leading to block all the requests on the primary pgs hold by
the OSD.
So the loop i saw was in fact just normal and but moving a 12 GB
object from one SATA to an other takes several minutes, to long in
fact for a ceph cluster with a lot of clients to survive
The lesson of this story is : Don't forget to shard your bucket !!!

---------------------------------------------------------------------------------------------------------------
Yesterday we just encountered this bug. One OSD was looping on
"2018-01-03 16:20:59.148121 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 30.254269 seconds old, received at 2018-01-03
16:20:28.883837: osd_op(client.48285929.0:14601958 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected
e359833) currently waiting for degraded object".

The requests on this OSD.150 went quickly in blocked state

2018-01-03 16:25:56.241064 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 20 slow requests, 1 included below; oldest blocked for >
327.357139 secs
2018-01-03 16:30:19.299288 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 45 slow requests, 1 included below; oldest blocked for >
590.415387 secs
...
...
2018-01-03 16:46:04.900204 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 100 slow requests, 2 included below; oldest blocked for >
1204.060056 secs

while still looping

2018-01-03 16:46:04.900220 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 123.294762 seconds old, received at 2018-01-03
16:44:01.605320 : osd_op(client.48285929.0:14605228 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_complete_op] snapc 0=[]
ack+ondisk+write+known_if_redirected e359833) currently waiting for
degraded object

All theses resquest were blocked on OSD.150.
A lot of VMs attached to Ceph were hanging.

The degraded object was
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 in the pg 35.2e.
This PG was located on 4 OSDs. The object has a 0 size on the 4 OSDs.
It was not possible to do a ceph osd pg 35.2e query with a response.
Killing the OSD.150 lead to the requests bloqued on the new primary.

I found the relatively new bug #22072 which looks like mine but there
was no response from the ceph team. I finally tried the same solution
: rados rm -p pool/degraded_object but with no response from the
command. I stopped the command after 15 mn. Few minutes later, the 4
OSDs holding the pg 35.2e suddenly rebooted and the problem was
solved. The object was deleted on the 4 OSDs.

Anyway, it leads to a production break and i have no idea of what
produced the "degraded object" and i'm not sure if the solution came
from my command or from a inside process. At this time we are still
trying to repare some filesystems of the VMs attached to Ceph and i
have to explain that this all production break comes from one empty
object ... The real problem is why Ceph was unable to handle this
"degraded object" and looped on it, blocking all the requests on the
OSD.150 ?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com