How much damage have I done to RGW hardcore-wiping a bucket out of its existence?

Katie Holly <8ld3jg4d@xxxxxx> · Sat, 14 Apr 2018 02:09:01 +0200

Hi everyone,

I found myself in a situation where dynamic sharding and writing data to a bucket containing a little more than 5M objects at the same time caused corruption on the data rendering the entire bucket unusable, I tried several solutions to fix this bucket and ended up ditching it.

What I tried before going the hardcore way:

* radosgw-admin reshard list -> didn't list any reshard process going on at the time, but
* radosgw-admin reshard cancel --bucket $bucket -> canceled the reshard process going on in the background, overall load on the cluster dropped after a few minutes

At this point I decided to start from scratch since a lot of the data was corrupted due to a broken application version writing to this bucket.

* aws s3 rm --recursive s3://$bucket -> Deleted most files, but 13k files consuming around 500G total weren't deleted, re-running the same command didn't fix that
* aws s3 rb s3://$bucket -> That obviously didn't work since the bucket isn't empty
* radosgw-admin bucket rm --bucket $bucket -> "ERROR: could not remove non-empty bucket $bucket" and "ERROR: unable to remove bucket(39) Directory not empty"
* radosgw-admin bucket rm --bucket $bucket --purge-objects -> "No such file or directory"

After some days of helpless Googling and trying various combinations of radosgw-admin bucket, bi, reshard and other commands that all did pretty much nothing, I did

* rados -p $pool ls | tr '\t' '\n' | fgrep $bucket_marker_id | tr '\n' '\0' | xargs -0 -n 128 -P 32 rados -p $pool rm

That deleted the orphan objects from the rados pool cleaning up the used ~500G of data.

* radosgw-admin bucket check --bucket $bucket -> listed some objects in an array, probably the lost ones that weren't deleted
* radosgw-admin bucket check --bucket $bucket --fix ( --check-objects) -> didn't do anything

* radosgw-admin bi purge --bucket=$bucket --yes-i-really-mean-it -> This deleted the bucket index
* radosgw-admin bucket list -> Bucket still appeared in the list
* aws s3 ls -> Bucket was still appearing in the list
* aws s3 rb $bucket -> "NoSuchBucket"
* aws s3 rm --recursive s3://$bucket -> No error or output
* aws s3 rb $bucket -> No error
* aws s3 ls -> Bucket is no longer in the list

At this point, I decided to restart all RGW frontend instances to make sure nothing is being cached. To confirm that it's really gone now, let's check everything...

* aws s3 ls -> Check.
* radosgw-admin bucket list -> Check.
* radosgw-admin metadata get bucket:$bucket -> Check.
* radosgw-admin bucket stats --bucket $bucket -> Check.
But:
* radosgw-admin reshard list -> It's doing a reshard, I stopped that for now. However, all RGW frontend instances were logging this repeatedly for some minutes:

> block_while_resharding ERROR: bucket is still resharding, please retry
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected
> NOTICE: resharding operation on bucket index detected, blocking
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
> block_while_resharding ERROR: bucket is still resharding, please retry
> NOTICE: resharding operation on bucket index detected, blocking
> NOTICE: resharding operation on bucket index detected, blocking
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected
> block_while_resharding ERROR: bucket is still resharding, please retry
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not connected
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
> NOTICE: resharding operation on bucket index detected, blocking

One of the RGW frontend instances crashed during this, all others seem to be running fine at the moment:

> 2018-04-13 23:19:41.599307 7f35c6e00700  0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5
> terminate called after throwing an instance of 'ceph::buffer::bad_alloc'
>   what():  buffer::bad_alloc
> *** Caught signal (Aborted) **
>  in thread 7f35f341d700 thread_name:msgr-worker-0

* aws s3 mb s3://$bucket -> This command succeeded
* aws s3 cp $file s3://$bucket/$file -> This command succeeded as well

My question at this point would be, how much have I damaged this cluster on an RGW pov and is it possible to undo those damages? If I want to proceed with cleaning up the old bucket data, where should I continue and how would I verify that everything, that might further damage the cluster at a later point, is really gone?

Thanks in advance for any help regarding this, and yes, I know that I should have asked on the mailing list first before doing anything stupid. Please let me know if I missed any information and I'll add it asap.

-- 
Best regards

Katie Holly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com