rgw resharding operation seemingly won't end

Ryan Leimenstoll <rleimens@xxxxxxxxxxxxxx> · Mon, 9 Oct 2017 16:59:24 -0400

Hi all, 

We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now seeing issues running radosgw. Specifically, it appears an automatically triggered resharding operation won’t end, despite the jobs being cancelled (radosgw-admin reshard cancel). I have also disabled dynamic sharding for the time being in the ceph.conf.

[root@objproxy02 ~]# radosgw-admin reshard list
[]

The two buckets were also reported in the `radosgw-admin reshard list` before our RGW frontends paused recently (and only came back after a service restart). These two buckets cannot currently be written to at this point either. 

2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: bucket is still resharding, please retry 
2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err err_no=2300 resorting to 500 
2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error 
2017-10-06 22:41:19.548570 7f90506e9700 1 ====== req done req=0x7f90506e3180 op status=-2300 http_status=500 ====== 
2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT / 
$REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 Python/2.7.12 Linux/4.9.43-17.3 
9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource 
[.. slightly later in the logs..]
2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends paused 
2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125 
2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation processing returned error r=-22 
2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation processing returned error r=-22 

Can anyone advise on the best path forward to stop the current sharding states and avoid this moving forward?

Some other details:
 - 3 rgw instances
 - Ceph Luminous 12.2.1
 - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs

Thanks,
Ryan Leimenstoll
rleimens@xxxxxxxxxxxxxx
University of Maryland Institute for Advanced Computer Studies

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com