RGW: Truncated objects and bad error handling

Jens Rosenboom <j.rosenboom@xxxxxxxx> · Thu, 1 Jun 2017 09:02:38 +0000

On a large Hammer-based cluster (> 1 Gobjects) we are seeing a small
amount of objects being truncated. All of these objects are between
512kB and 4MB in size and they are not uploaded as multipart, so the
first 512kB get stored into the head object and the next chunks should
be in tail objects named <bucket_id>__shadow_<tag>_N, but the latter
seem to go missing sometimes. The PUT operation for these objects is
logged as successful (HTTP code 200), so I'm currently having two
hypotheses as to what might be happening:

1. The object is received by the radosgw process, the head object is
written successfully, then the write for the tail object somehow
fails. So the question is whether this is possible or whether radosgw
will always wait until all operations have completed successfully
before returning the 200. This blog [1] at least mentions some
asynchronous operations.

2. The full object is written correctly, but the tail objects are
getting deleted somehow afterwards. This might happen during garbage
collection if there was a collision between the tail object names for
two objects, but again I'm not sure whether this is possible.

So the question is whether anyone else has seen this issue, also
whether it may possibly be fixed in Jewel or later.

The second issue is what happens when a client tries to access such an
truncated object. The radosgw first answers with the full headers and
a content-length of e.g. 600000, then sends the first chunk of data
(524288 bytes) from the head object. After that it tries to read the
first tail object, but receives an error -2 (file not found). radosgw
now tries to send a 404 status with a NoSuchKey error in XML body, but
of course this is too late, the clients sees this as part of the
object data. After that, the connection stays open, the clients waits
for the rest of the object to be sent and times out with an error in
the end. Or, if the original object was just slightly larger than
512k, the client will append the 404 header at that point and continue
with corrupted data, hopefully checking the MD5 sum and noticing the
issue. This behaviour is still unchanged at least in Jewel and you can
easily reproduce it by manually deleting the shadow object from the
bucket pool after you have uploaded an object of the proper size.

I have created a bug report with the first issue[2], please let me
know whether you would like a different ticket for the second one.

[1] http://www.ksingh.co.in/blog/2017/01/15/ceph-object-storage-performance-improvement-using-indexless-buckets/
[2] http://tracker.ceph.com/issues/20107
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com