Re: RGW: Truncated objects and bad error handling

Jens Rosenboom <j.rosenboom@xxxxxxxx> · Wed, 7 Jun 2017 09:23:16 +0000

2017-06-01 18:52 GMT+00:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
>
>
> On Thu, Jun 1, 2017 at 2:03 AM Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
>>
>> On a large Hammer-based cluster (> 1 Gobjects) we are seeing a small
>> amount of objects being truncated. All of these objects are between
>> 512kB and 4MB in size and they are not uploaded as multipart, so the
>> first 512kB get stored into the head object and the next chunks should
>> be in tail objects named <bucket_id>__shadow_<tag>_N, but the latter
>> seem to go missing sometimes. The PUT operation for these objects is
>> logged as successful (HTTP code 200), so I'm currently having two
>> hypotheses as to what might be happening:
>>
>> 1. The object is received by the radosgw process, the head object is
>> written successfully, then the write for the tail object somehow
>> fails. So the question is whether this is possible or whether radosgw
>> will always wait until all operations have completed successfully
>> before returning the 200. This blog [1] at least mentions some
>> asynchronous operations.
>>
>> 2. The full object is written correctly, but the tail objects are
>> getting deleted somehow afterwards. This might happen during garbage
>> collection if there was a collision between the tail object names for
>> two objects, but again I'm not sure whether this is possible.
>>
>> So the question is whether anyone else has seen this issue, also
>> whether it may possibly be fixed in Jewel or later.

So I think I found out what is happening, which seems to be a pretty
severe bug: When an object is copied, is seems like the copy is
created with the same prefix for shadow objects. So when the copied
object is afterwards deleted, garbage collection will delete the
shadow object, rendering the original object truncated.

>> The second issue is what happens when a client tries to access such an
>> truncated object. The radosgw first answers with the full headers and
>> a content-length of e.g. 600000, then sends the first chunk of data
>> (524288 bytes) from the head object. After that it tries to read the
>> first tail object, but receives an error -2 (file not found). radosgw
>> now tries to send a 404 status with a NoSuchKey error in XML body, but
>> of course this is too late, the clients sees this as part of the
>> object data. After that, the connection stays open, the clients waits
>> for the rest of the object to be sent and times out with an error in
>> the end. Or, if the original object was just slightly larger than
>> 512k, the client will append the 404 header at that point and continue
>> with corrupted data, hopefully checking the MD5 sum and noticing the
>> issue. This behaviour is still unchanged at least in Jewel and you can
>> easily reproduce it by manually deleting the shadow object from the
>> bucket pool after you have uploaded an object of the proper size.
>>
>> I have created a bug report with the first issue[2], please let me
>> know whether you would like a different ticket for the second one.
>
>
>
> No idea what's going on here but they definitely warrant separate issues.
> The second one is about handling error states; the first is about inducing
> them. :)

http://tracker.ceph.com/issues/20166
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com