Hi everyone, I found myself in a situation where dynamic sharding and writing data to a bucket containing a little more than 5M objects at the same time caused corruption on the data rendering the entire bucket unusable, I tried several solutions to fix this bucket and ended up ditching it. What I tried before going the hardcore way: * radosgw-admin reshard list -> didn't list any reshard process going on at the time, but * radosgw-admin reshard cancel --bucket $bucket -> canceled the reshard process going on in the background, overall load on the cluster dropped after a few minutes At this point I decided to start from scratch since a lot of the data was corrupted due to a broken application version writing to this bucket. * aws s3 rm --recursive s3://$bucket -> Deleted most files, but 13k files consuming around 500G total weren't deleted, re-running the same command didn't fix that * aws s3 rb s3://$bucket -> That obviously didn't work since the bucket isn't empty * radosgw-admin bucket rm --bucket $bucket -> "ERROR: could not remove non-empty bucket $bucket" and "ERROR: unable to remove bucket(39) Directory not empty" * radosgw-admin bucket rm --bucket $bucket --purge-objects -> "No such file or directory" After some days of helpless Googling and trying various combinations of radosgw-admin bucket, bi, reshard and other commands that all did pretty much nothing, I did * rados -p $pool ls | tr '\t' '\n' | fgrep $bucket_marker_id | tr '\n' '\0' | xargs -0 -n 128 -P 32 rados -p $pool rm That deleted the orphan objects from the rados pool cleaning up the used ~500G of data. * radosgw-admin bucket check --bucket $bucket -> listed some objects in an array, probably the lost ones that weren't deleted * radosgw-admin bucket check --bucket $bucket --fix ( --check-objects) -> didn't do anything * radosgw-admin bi purge --bucket=$bucket --yes-i-really-mean-it -> This deleted the bucket index * radosgw-admin bucket list -> Bucket still appeared in the list * aws s3 ls -> Bucket was still appearing in the list * aws s3 rb $bucket -> "NoSuchBucket" * aws s3 rm --recursive s3://$bucket -> No error or output * aws s3 rb $bucket -> No error * aws s3 ls -> Bucket is no longer in the list At this point, I decided to restart all RGW frontend instances to make sure nothing is being cached. To confirm that it's really gone now, let's check everything... * aws s3 ls -> Check. * radosgw-admin bucket list -> Check. * radosgw-admin metadata get bucket:$bucket -> Check. * radosgw-admin bucket stats --bucket $bucket -> Check. But: * radosgw-admin reshard list -> It's doing a reshard, I stopped that for now. However, all RGW frontend instances were logging this repeatedly for some minutes: > block_while_resharding ERROR: bucket is still resharding, please retry > RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected > NOTICE: resharding operation on bucket index detected, blocking > NOTICE: resharding operation on bucket index detected, blocking > block_while_resharding ERROR: bucket is still resharding, please retry > block_while_resharding ERROR: bucket is still resharding, please retry > NOTICE: resharding operation on bucket index detected, blocking > NOTICE: resharding operation on bucket index detected, blocking > RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected > RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected > block_while_resharding ERROR: bucket is still resharding, please retry > RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected > NOTICE: resharding operation on bucket index detected, blocking > block_while_resharding ERROR: bucket is still resharding, please retry > NOTICE: resharding operation on bucket index detected, blocking One of the RGW frontend instances crashed during this, all others seem to be running fine at the moment: > 2018-04-13 23:19:41.599307 7f35c6e00700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 > terminate called after throwing an instance of 'ceph::buffer::bad_alloc' > what(): buffer::bad_alloc > *** Caught signal (Aborted) ** > in thread 7f35f341d700 thread_name:msgr-worker-0 * aws s3 mb s3://$bucket -> This command succeeded * aws s3 cp $file s3://$bucket/$file -> This command succeeded as well My question at this point would be, how much have I damaged this cluster on an RGW pov and is it possible to undo those damages? If I want to proceed with cleaning up the old bucket data, where should I continue and how would I verify that everything, that might further damage the cluster at a later point, is really gone? Thanks in advance for any help regarding this, and yes, I know that I should have asked on the mailing list first before doing anything stupid. Please let me know if I missed any information and I'll add it asap. -- Best regards Katie Holly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com