Re: Cleaning Up Failed Multipart Uploads

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 3 Aug 2016 10:46:42 -0700

On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
This may just be me having a conversation with myself, but maybe this will be helpful to someone else.

Having dug and dug and dug through the code, I've come to the following realizations:
When a multipart upload is completed, the function list_multipart_parts in rgw_op.cc is called.  This seems to be the start of the problems, as it will only return those parts in the 'multipart' namespace that include the upload id in the name, irrespective of how many copies of parts exist on the system with non-upload id prefixes
In the course of writing to the OSDs, a list (remove_objs) is processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be decremented
These decremented stats are written to the bucket's index entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER case in ReplicatedPG::do_osd_ops
So this explains why manually removing the multipart entries from .rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not cause the bucket's stats to be updated.  What I don't know how to do is force an update of the bucket's stats from the CLI.  I can retrieve the omap header from each of the bucket's shards in .rgw.buckets.index, but I don't have the first clue how to read the data or rebuild it into something valid.  I've searched the docs and mailing list archives, but I didn't find any solution to this problem.  For what it's worth, I've tried 'bucket check' with all combinations of '--check-objects' and '--fix' after cleaning up .rgw.buckets and .rgw.buckets.index.
From a long-term perspective, it seems there are two possible fixes here:
Update the logic in list_multipart_parts to return all the parts for a multipart object, so that *all* parts in the 'multipart' namespace can be properly removed
Update the logic in RGWPutObj::execute() to not restart a write if the put_data_and_throttle() call returns -EEXIST but instead put the data in the original file(s)
While I think 2 would involve the least amount of yak shaving with the multipart logic since the MP logic already assumes a happy path where all objects have a prefix of the multipart upload id, I'm all but certain this is going to horribly break many other parts of the system that I don't fully understand.

#2 is dangerous. That was the original behavior, and it is racy and *will* lead to data corruption.  OTOH, I don't think #1 is an easy option. We only keep a single entry per part, so we don't really have a good way to see all the uploaded pieces. We could extend the meta object to keep record of all the uploaded parts, and at the end, when assembling everything remove the parts that aren't part of the final assembly.
The good news is that the assembly of the multipart object is being done correctly; what I can't figure out is how it knows about the non-upload id prefixes when creating the metadata on the multipart object in .rgw.buckets.  My best guess is that it's copying the metadata from the 'meta' object in .rgw.buckets.extra (which is correctly updated with the new part prefixes after each successful upload), but I haven't absolutely confirmed that.

Yeah, something along these lines.

If one of the developer folk that are more familiar with this could weigh in, I would be greatly appreciative. 

btw, did you try to run the radosgw-admin orphan find tool? 

Yehuda

Brian

On Tue, Aug 2, 2016 at 8:59 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
I am actively working through the code and debugging everything.  I figure the issue is with how RGW is listing the parts of a multipart upload when it completes or aborts the upload (read: it's not getting *all* the parts, just those that are either most recent or tagged with the upload id).  As soon as I can figure out a patch, or, more importantly, how to manually address the problem, I will respond with instructions.

The reported bug contains detailed instructions on reproducing the problem, so it's trivial to reproduce and test on a small and/or new cluster.

Brian

On Tue, Aug 2, 2016 at 8:53 AM, Tyler Bishop <tyler.bishop@xxxxxxxxxxxxxxxxx> wrote:
We're having the same issues.   I have a 1200TB pool at 90% utilization however disk utilization is only 40%

  Tyler Bishop
 Chief Technical Officer
 513-299-7108 x10
Tyler.Bishop@xxxxxxxxxxxxxxxxx
If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

From: "Brian Felton" <bjfelton@xxxxxxxxx>
To: "ceph-users" <ceph-users@xxxxxxxx>
Sent: Wednesday, July 27, 2016 9:24:30 AM
Subject: [ceph-users] Cleaning Up Failed Multipart Uploads

Greetings,

Background: If an object storage client re-uploads parts to a multipart object, RadosGW does not clean up all of the parts properly when the multipart upload is aborted or completed.  You can read all of the gory details (including reproduction steps) in this bug report: http://tracker.ceph.com/issues/16767.

My setup: Hammer 0.94.6 cluster only used for S3-compatible object storage.  RGW stripe size is 4MiB.

My problem: I have buckets that are reporting TB more utilization (and, in one case, 200k more objects) than they should report.  I am trying to remove the detritus from the multipart uploads, but removing the leftover parts directly from the .rgw.buckets pool is having no effect on bucket utilization (i.e. neither the object count nor the space used are declining).  

To give an example, I have a client that uploaded a very large multipart object (8000 15MiB parts).  Due to a bug in the client, it uploaded each of the 8000 parts 6 times.  After the sixth attempt, it gave up and aborted the upload, at which point RGW removed the 8000 parts from the sixth attempt.  When I list the bucket's contents with radosgw-admin (radosgw-admin bucket list --bucket=<bucket> --max-entries=<size of bucket>), I see all of the object's 8000 parts five separate times, each under a namespace of 'multipart'.  

Since the multipart upload was aborted, I can't remove the object by name via the S3 interface.  Since my RGW stripe size is 4MiB, I know that each part of the object will be stored across 4 entries in the .rgw.buckets pool -- 4 MiB in a 'multipart' file, and 4, 4, and 3 MiB in three successive 'shadow' files.  I've created a script to remove these parts (rados -p .rgw.buckets rm <bucket_id>__multipart_<object+prefix>.<part> and rados -p .rgw.buckets rm <bucket_id>__shadow_<object+prefix>.<part>.[1-3]).  The removes are completing successfully (in that additional attempts to remove the object result in a failure), but I'm not seeing any decrease in the bucket's space used, nor am I seeing a decrease in the bucket's object count.  In fact, if I do another 'bucket list', all of the removed parts are still included.

I've looked at the output of 'gc list --include-all', and the removed parts are never showing up for garbage collection.  Garbage collection is otherwise functioning normally and will successfully remove data for any object properly removed via the S3 interface.

I've also gone so far as to write a script to list the contents of bucket shards in the .rgw.buckets.index pool, check for the existence of the entry in .rgw.buckets, and remove entries that cannot be found, but that is also failing to decrement the size/object count counters.

What am I missing here?  Where, aside from .rgw.buckets and .rgw.buckets.index is RGW looking to determine object count and space used for a bucket?

Many thanks to any and all who can assist.

Brian Felton

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com