Re: RGW: Truncated objects and bad error handling

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 01 Jun 2017 18:52:42 +0000

On Thu, Jun 1, 2017 at 2:03 AM Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
On a large Hammer-based cluster (> 1 Gobjects) we are seeing a small

amount of objects being truncated. All of these objects are between

512kB and 4MB in size and they are not uploaded as multipart, so the

first 512kB get stored into the head object and the next chunks should

be in tail objects named <bucket_id>__shadow_<tag>_N, but the latter

seem to go missing sometimes. The PUT operation for these objects is

logged as successful (HTTP code 200), so I'm currently having two

hypotheses as to what might be happening:

1. The object is received by the radosgw process, the head object is

written successfully, then the write for the tail object somehow

fails. So the question is whether this is possible or whether radosgw

will always wait until all operations have completed successfully

before returning the 200. This blog [1] at least mentions some

asynchronous operations.

2. The full object is written correctly, but the tail objects are

getting deleted somehow afterwards. This might happen during garbage

collection if there was a collision between the tail object names for

two objects, but again I'm not sure whether this is possible.

So the question is whether anyone else has seen this issue, also

whether it may possibly be fixed in Jewel or later.

The second issue is what happens when a client tries to access such an

truncated object. The radosgw first answers with the full headers and

a content-length of e.g. 600000, then sends the first chunk of data

(524288 bytes) from the head object. After that it tries to read the

first tail object, but receives an error -2 (file not found). radosgw

now tries to send a 404 status with a NoSuchKey error in XML body, but

of course this is too late, the clients sees this as part of the

object data. After that, the connection stays open, the clients waits

for the rest of the object to be sent and times out with an error in

the end. Or, if the original object was just slightly larger than

512k, the client will append the 404 header at that point and continue

with corrupted data, hopefully checking the MD5 sum and noticing the

issue. This behaviour is still unchanged at least in Jewel and you can

easily reproduce it by manually deleting the shadow object from the

bucket pool after you have uploaded an object of the proper size.

I have created a bug report with the first issue[2], please let me

know whether you would like a different ticket for the second one.

No idea what's going on here but they definitely warrant separate issues. The second one is about handling error states; the first is about inducing them. :) 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com