Re: Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation

Mark Murphy <murphymarkw@xxxxxxxxxxxx> · Wed, 13 May 2015 15:01:00 +0000

Awesome! Thanks very much for looking into this. I also added a couple of quick responses to your comments.

> On May 12, 2015, at 7:02 PM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> 
> I opened issue #11604, and have a fix for the issue. I updated our test suite to cover the specific issue that you were hitting. We'll backport the fix to both hammer and firefly soon.
> 
> Thanks!
> Yehuda
> 
> ----- Original Message -----
>> From: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
>> To: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx, "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>
>> Sent: Tuesday, May 12, 2015 12:59:48 PM
>> Subject: Re:  Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation
>> 
>> Hi,
>> 
>> Thank you for a very thorough investigation. See my comments below:
>> 
>> ----- Original Message -----
>>> From: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx>
>>> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
>>> Cc: "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx
>>> Sent: Tuesday, May 12, 2015 10:50:49 AM
>>> Subject: Re:  Civet RadosGW S3 not storing complete obects;
>>> civetweb logs stop after rotation
>>> 
>>> Hey Yehuda,
>>> 
>>> I work with Sean on the dev side. We thought we should put together a short
>>> report on what we’ve been seeing in the hopes that the behavior might make
>>> some sense to you.
>>> 
>>> We had originally noticed these issues a while ago with our first iteration
>>> of this particular Ceph deployment. The issues we had seen were
>>> characterized by two different behaviors:
>>> 
>>> 	• Some objects would appear truncated, returning different sizes for each
>>> 	request. Repeated attempts would eventually result in a successful
>>> 	retrieval if the second behavior doesn’t apply.
>> 
>> This really sound like some kind of networking issue, maybe a load balancer
>> that is on the way that clobbers things?

Unfortunately, we don’t have any systems in-between the clients and the gateway server, sans a switch. That said...

>> 
>>> 	• Some objects would always appear truncated, missing an integer multiple
>>> 	of
>>> 	512KB.
>>> 
>>> This is where the report that we are encountering ‘truncation’ came from,
>>> which is slightly misleading. We recently verified that we are indeed
>>> encountering the first behavior, for which I believe Sean has supplied or
>>> will be supplying Ceph logs showcasing the server-side errors, and is true
>>> truncation. However, the second behavior is not really truncation, but
>>> missing 512KB chunks, as Sean has brought up.
>>> 
>>> We’ve had some luck with identifying some of the patterns that are
>>> seemingly
>>> related to this issue. Without going into too great of detail, we’ve found
>>> the following appear to hold true for all objects affected by the second
>>> behavior:
>>> 
>>> 	• The amount of data missing is always in integer multiples of 512KB.
>>> 	• The expected file size is always found via the bucket index.
>>> 	• Ceph objects do not appear to be missing chunks or have holes in them.
>>> 	• The missing 512KB chunks are always at the beginning of multipart
>>> 	segments
>>> 	(1GB in our case).
>> 
>> This matches some of my original suspicions. Here's some basic background
>> that might help clarify things:
>> 
>> This looks like some kind of rgw bug. A radosgw object is usually composed of
>> two different parts: the object head, and the object tail. The head is
>> usually composed of the first 512k of data of the object (and never more
>> than that), and the tail has the rest of the object's data. However, the
>> head data part is optional, and it can be zero. For example, in the case of
>> multipart upload, after combining the parts, the head will not have any
>> data, and the tail will be compiled out of the different parts data.
>> However, when dealing with multipart parts, the parts do not really have a
>> head (due to their immutability), so it is expected that the part object
>> sizes to be 4MB. So it seems that for some reason these specific parts were
>> treated as if they had a head, although they shouldn't have. Now, that
>> brings me to the issue, where I noticed that some of the parts were retried.
>> When this happens, the part name is different than the default part name, so
>> there's a note in the manifest, and a special handling that start at
>> specific offsets. It might be that this is related, and the code that
>> handles the retries generate bad object parts.
>> 

We have seen some networking hiccups, some of which, if I recall correctly, are related to our switches. And we do have some multipart part-centric retry code in place. Indeed, none of our testing had covered retrying individual parts, which does mismatch what our client would do given a connection drop or other networking issue. Sean and I can try giving this a whirl and see what happens.

And thank you for the information on how the multipart parts are handled under the hood. I’m not as well versed in Ceph or the gateway as Sean is, so this is very useful to know. Thanks!

>> 
>> 
>>> 	• For large files missing multiple chunks, the segments affected appear to
>>> 	be clustered and contiguous.
>>> 
>> 
>> That would point at a cluster of retries, maybe due to networking issues
>> around the time these were created.
>> 
>>> The first pattern was identified when we noticed that the bucket index and
>>> the object manifest differed in reported size. This is useful as an quick
>>> method of identifying affected objects. We’ve used this to avoid having to
>>> pull down and check each object individually. In total, we have 108
>>> affected
>>> objects, which translates to approximately 0.25% of our S3 objects.
>>> 
>>> We noticed that the bucket index always reports the object size that would
>>> be
>>> expected had the upload gone correctly. Since we only ever report the
>>> segment sizes to the gateway, this would suggest that the segment sizes
>>> were
>>> reported accurately and aggregated correctly server side.
>>> 
>>> Sean identified the Ceph objects that compose one of our affected S3
>>> objects.
>>> We thought we might see the first Ceph object missing some data, but found
>>> it to be a full 4MB. Retrieving the first Ceph object and comparing it to
>>> the bytes in the corresponding file, it appears that the Ceph object
>>> matches
>>> the 4MB of the file after the first 512KB. We took this as evidence that
>>> the
>>> data was never getting to Ceph in the first place. However, in our testing,
>>> we were unable to get the gateway to accept segments with less data than
>>> reported.
>>> 
>>> Dissecting some of the affected objects, we were able to identify the
>>> positions of the missing 512KB segments. The positions align with the
>>> starting positions of 1GB segments, matching our multipart segment length.
>>> Additionally, for files where we had a large number of missing chunks, the
>>> segments appeared to cluster (e.g. segments 161-163 and 170-189 in a ~220
>>> segment file). However, there did not appear to be a pattern as to which
>>> segments were affected.
>>> 
>>> 
>>> Some of our testing:
>>> 
>>> One of our first thoughts was that it might be only specific files, perhaps
>>> particular file sizes, that caused the behavior to manifest. We generated
>>> fresh copies of the files in question and re-uploaded them without error.
>>> We
>>> repeatedly uploaded the same object, and always received positive results.
>>> 
>>> We next asked whether this could be connection related. We simulated a
>>> connection drop by killing the upload processes early. This resulted in
>>> uncompleted multipart uploads, as expected. Completing these resulted in
>>> truncated objects with a mismatch between the bucket index and the object
>>> manifest. This is the only test that resulted in this size mismatch
>>> behavior, however the objects were missing data from the end of the file,
>>> specifically the last segment - actual truncation vs the missing 512KB at
>>> the beginning of segments.
>>> 
>>> We then attempted to forge false multipart requests in an attempt to
>>> replicate the behavior assuming a bad client. Providing data differing than
>>> what was used to produce the md5 digest always resulted in a BadDigest
>>> error. Forging the content-length header but providing the correct md5
>>> digest, to simulate a portion of a file not being sent, would cause a short
>>> hang and then a PermissionDenied error. In no cases were we able to get a
>>> positive response from the gateway.
>>> 
>>> 
>>> What is confusing about all of this is that the bucket index is correct.
>>> That
>>> is, the index has the size expected had the upload completed correctly.
>>> This
>>> would suggest that the gateway correctly inferred the full size from the
>>> segment sizes, which means that our upload had reported correctly the full
>>> file size via segment sizes. But every time we’ve attempted to send less
>>> data than is reported to the gateway, as would be the case if the first
>>> 512KB were never sent, we encounter an error.
>>> 
>>> We’re still working to track down where the issue is. We’re hoping it’s
>>> just
>>> something in the client that hiccuped, but any insight into these behaviors
>>> from a gateway perspective would be greatly appreciated.
>>> 
>>> Thanks,
>>> Mark
>> 
>> Thank you,
>> 
>> I'll also try to check things on my end, considering the new info.
>> 
>> Yehuda
>> 
>>> 
>>> 
>>>> On May 7, 2015, at 12:09 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx>
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Sean" <seapasulli@xxxxxxxxxxxx>
>>>>> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>>> Sent: Tuesday, May 5, 2015 12:14:19 PM
>>>>> Subject: Re:  Civet RadosGW S3 not storing complete obects;
>>>>> civetweb logs stop after rotation
>>>>> 
>>>>> 
>>>>> 
>>>>> Hello Yehuda and the rest of the mailing list.
>>>>> 
>>>>> 
>>>>> My main question currently is why are the bucket index and the object
>>>>> manifest ever different? Based on how we are uploading data I do not
>>>>> think
>>>>> that the rados gateway should ever know the full file size without
>>>>> having
>>>>> all of the objects within ceph at one point in time. So after the
>>>>> multipart
>>>>> is marked as completed Rados gateway should cat through all of the
>>>>> objects
>>>>> and make a complete part, correct?
>>>> 
>>>> That's what *should* happen, but obviously there's some bug there.
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Secondly,
>>>>> 
>>>>> I think I am not understanding the process to grab all of the parts
>>>>> correctly. To continue to use my example file
>>>>> "86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam"
>>>>> in bucket tcga_cghub_protected. I would be using the following to grab
>>>>> the
>>>>> prefix:
>>>>> 
>>>>> 
>>>>> prefix=$(radosgw-admin object stat --bucket=tcga_cghub_protected
>>>>> --object=86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam
>>>>> | grep -iE '"prefix"' | awk -F"\"" '{print $4}')
>>>>> 
>>>>> 
>>>>> Which should take everything between quotes for the prefix key and give
>>>>> me
>>>>> the value.
>>>>> 
>>>>> 
>>>>> In this case::
>>>>> 
>>>>> "prefix":
>>>>> "86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S",
>>>>> 
>>>>> 
>>>>> So
>>>>> 
>>>>> lacadmin@kh10-9:~$ echo ${prefix}
>>>>> 
>>>>> 86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S
>>>>> 
>>>>> 
>>>>> From here I list all of the objects in the .rgw.buckets pool and grep
>>>>> for
>>>>> that said prefix which yields 1335 objects. From here if I cat all of
>>>>> these
>>>>> objects together I only end up with a 5468160 byte file which is 2G
>>>>> short
>>>>> of
>>>>> what the object manifest says it should be. If I grab the file and tail
>>>>> the
>>>>> Rados gateway log I end up with 1849 objects and when I sum them all up
>>>>> I
>>>> 
>>>> How are these objects named?
>>>> 
>>>>> end up with 7744771642 which is the same size that the manifest reports.
>>>>> I
>>>>> understand that this does nothing other than verify the manifests
>>>>> accuracy
>>>>> but I still find it interesting. The missing chunks may still exist in
>>>>> ceph
>>>>> outside of the object manifest and tagged with the same prefix, correct?
>>>>> Or
>>>>> am I misunderstanding something?
>>>> 
>>>> Either it's missing a chunk, or one of the objects is truncated. Can you
>>>> stat all the parts? I expect most of the objects to have two different
>>>> sizes (e.g., 4MB, 1MB), but at it is likely that the last part is
>>>> smaller,
>>>> and maybe another object that is missing 512k.
>>>> 
>>>>> 
>>>>> 
>>>>> We have over 40384 files in the tcga_cghub_protected bucket and only 66
>>>>> of
>>>>> these files are suffering from this truncation issue. What I need to
>>>>> know
>>>>> is: is this happening on the gateway side or on the client side? Next I
>>>>> need
>>>>> to know what possible actions can occur where the bucket index and the
>>>>> object manifest would be mismatched like this as 40318 out of 40384 are
>>>>> working without issue.
>>>>> 
>>>>> 
>>>>> The truncated files are of all different sizes (5 megabytes - 980
>>>>> gigabytes)
>>>>> and the truncation seems to be all over. By "all over" I mean some files
>>>>> are
>>>>> missing the first few bytes that should read "bam" and some are missing
>>>>> parts in the middle.
>>>> 
>>>> Can you give an example of an object manifest for a broken object, and
>>>> all
>>>> the rados objects that build it (e.g., the output of 'rados stat' on
>>>> these
>>>> objects). A smaller object might be easier.
>>>> 
>>>>> 
>>>>> 
>>>>> So our upload code is using mmap to stream chunks of the file to the
>>>>> Rados
>>>>> gateway via a multipart upload but no where on the client side do we
>>>>> have
>>>>> a
>>>>> direct reference to the files we are using nor do we specify the size in
>>>>> anyway. So where is the gateway getting the correct complete filesize
>>>>> from
>>>>> and how is the bucket index showing the intended file size?
>>>>> 
>>>>> 
>>>>> This implies that, at some point in time, ceph was able to see all of
>>>>> the
>>>>> parts of the file and calculate the correct total size. This to me seems
>>>>> like a rados gateway bug regardless of how the file is being uploaded. I
>>>>> think that the RGW should be able to be fuzzed and still store the data
>>>>> correctly.
>>>>> 
>>>>> 
>>>>> Why is the bucket list not matching the bucket index and how can I
>>>>> verify
>>>>> that the data is not being corrupted by the RGW or worse, after it is
>>>>> committed to ceph ?
>>>> 
>>>> That's what we're trying to find out.
>>>> 
>>>> Thanks,
>>>> Yehuda
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com