Re: Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Tue, 12 May 2015 20:02:20 -0400 (EDT)

I opened issue #11604, and have a fix for the issue. I updated our test suite to cover the specific issue that you were hitting. We'll backport the fix to both hammer and firefly soon.

Thanks!
Yehuda

----- Original Message -----
> From: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
> To: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx, "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>
> Sent: Tuesday, May 12, 2015 12:59:48 PM
> Subject: Re:  Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation
> 
> Hi,
> 
> Thank you for a very thorough investigation. See my comments below:
> 
> ----- Original Message -----
> > From: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx>
> > To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
> > Cc: "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx
> > Sent: Tuesday, May 12, 2015 10:50:49 AM
> > Subject: Re:  Civet RadosGW S3 not storing complete obects;
> > civetweb logs stop after rotation
> > 
> > Hey Yehuda,
> > 
> > I work with Sean on the dev side. We thought we should put together a short
> > report on what we’ve been seeing in the hopes that the behavior might make
> > some sense to you.
> > 
> > We had originally noticed these issues a while ago with our first iteration
> > of this particular Ceph deployment. The issues we had seen were
> > characterized by two different behaviors:
> > 
> > 	• Some objects would appear truncated, returning different sizes for each
> > 	request. Repeated attempts would eventually result in a successful
> > 	retrieval if the second behavior doesn’t apply.
> 
> This really sound like some kind of networking issue, maybe a load balancer
> that is on the way that clobbers things?
> 
> > 	• Some objects would always appear truncated, missing an integer multiple
> > 	of
> > 	512KB.
> > 
> > This is where the report that we are encountering ‘truncation’ came from,
> > which is slightly misleading. We recently verified that we are indeed
> > encountering the first behavior, for which I believe Sean has supplied or
> > will be supplying Ceph logs showcasing the server-side errors, and is true
> > truncation. However, the second behavior is not really truncation, but
> > missing 512KB chunks, as Sean has brought up.
> > 
> > We’ve had some luck with identifying some of the patterns that are
> > seemingly
> > related to this issue. Without going into too great of detail, we’ve found
> > the following appear to hold true for all objects affected by the second
> > behavior:
> > 
> > 	• The amount of data missing is always in integer multiples of 512KB.
> > 	• The expected file size is always found via the bucket index.
> > 	• Ceph objects do not appear to be missing chunks or have holes in them.
> > 	• The missing 512KB chunks are always at the beginning of multipart
> > 	segments
> > 	(1GB in our case).
> 
> This matches some of my original suspicions. Here's some basic background
> that might help clarify things:
> 
> This looks like some kind of rgw bug. A radosgw object is usually composed of
> two different parts: the object head, and the object tail. The head is
> usually composed of the first 512k of data of the object (and never more
> than that), and the tail has the rest of the object's data. However, the
> head data part is optional, and it can be zero. For example, in the case of
> multipart upload, after combining the parts, the head will not have any
> data, and the tail will be compiled out of the different parts data.
> However, when dealing with multipart parts, the parts do not really have a
> head (due to their immutability), so it is expected that the part object
> sizes to be 4MB. So it seems that for some reason these specific parts were
> treated as if they had a head, although they shouldn't have. Now, that
> brings me to the issue, where I noticed that some of the parts were retried.
> When this happens, the part name is different than the default part name, so
> there's a note in the manifest, and a special handling that start at
> specific offsets. It might be that this is related, and the code that
> handles the retries generate bad object parts.
> 
> 
> 
> > 	• For large files missing multiple chunks, the segments affected appear to
> > 	be clustered and contiguous.
> > 
> 
> That would point at a cluster of retries, maybe due to networking issues
> around the time these were created.
> 
> > The first pattern was identified when we noticed that the bucket index and
> > the object manifest differed in reported size. This is useful as an quick
> > method of identifying affected objects. We’ve used this to avoid having to
> > pull down and check each object individually. In total, we have 108
> > affected
> > objects, which translates to approximately 0.25% of our S3 objects.
> > 
> > We noticed that the bucket index always reports the object size that would
> > be
> > expected had the upload gone correctly. Since we only ever report the
> > segment sizes to the gateway, this would suggest that the segment sizes
> > were
> > reported accurately and aggregated correctly server side.
> > 
> > Sean identified the Ceph objects that compose one of our affected S3
> > objects.
> > We thought we might see the first Ceph object missing some data, but found
> > it to be a full 4MB. Retrieving the first Ceph object and comparing it to
> > the bytes in the corresponding file, it appears that the Ceph object
> > matches
> > the 4MB of the file after the first 512KB. We took this as evidence that
> > the
> > data was never getting to Ceph in the first place. However, in our testing,
> > we were unable to get the gateway to accept segments with less data than
> > reported.
> > 
> > Dissecting some of the affected objects, we were able to identify the
> > positions of the missing 512KB segments. The positions align with the
> > starting positions of 1GB segments, matching our multipart segment length.
> > Additionally, for files where we had a large number of missing chunks, the
> > segments appeared to cluster (e.g. segments 161-163 and 170-189 in a ~220
> > segment file). However, there did not appear to be a pattern as to which
> > segments were affected.
> > 
> > 
> > Some of our testing:
> > 
> > One of our first thoughts was that it might be only specific files, perhaps
> > particular file sizes, that caused the behavior to manifest. We generated
> > fresh copies of the files in question and re-uploaded them without error.
> > We
> > repeatedly uploaded the same object, and always received positive results.
> > 
> > We next asked whether this could be connection related. We simulated a
> > connection drop by killing the upload processes early. This resulted in
> > uncompleted multipart uploads, as expected. Completing these resulted in
> > truncated objects with a mismatch between the bucket index and the object
> > manifest. This is the only test that resulted in this size mismatch
> > behavior, however the objects were missing data from the end of the file,
> > specifically the last segment - actual truncation vs the missing 512KB at
> > the beginning of segments.
> > 
> > We then attempted to forge false multipart requests in an attempt to
> > replicate the behavior assuming a bad client. Providing data differing than
> > what was used to produce the md5 digest always resulted in a BadDigest
> > error. Forging the content-length header but providing the correct md5
> > digest, to simulate a portion of a file not being sent, would cause a short
> > hang and then a PermissionDenied error. In no cases were we able to get a
> > positive response from the gateway.
> > 
> > 
> > What is confusing about all of this is that the bucket index is correct.
> > That
> > is, the index has the size expected had the upload completed correctly.
> > This
> > would suggest that the gateway correctly inferred the full size from the
> > segment sizes, which means that our upload had reported correctly the full
> > file size via segment sizes. But every time we’ve attempted to send less
> > data than is reported to the gateway, as would be the case if the first
> > 512KB were never sent, we encounter an error.
> > 
> > We’re still working to track down where the issue is. We’re hoping it’s
> > just
> > something in the client that hiccuped, but any insight into these behaviors
> > from a gateway perspective would be greatly appreciated.
> > 
> > Thanks,
> > Mark
> 
> Thank you,
> 
> I'll also try to check things on my end, considering the new info.
> 
> Yehuda
> 
> > 
> > 
> > > On May 7, 2015, at 12:09 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx>
> > > wrote:
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > >> From: "Sean" <seapasulli@xxxxxxxxxxxx>
> > >> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
> > >> Cc: ceph-users@xxxxxxxxxxxxxx
> > >> Sent: Tuesday, May 5, 2015 12:14:19 PM
> > >> Subject: Re:  Civet RadosGW S3 not storing complete obects;
> > >> civetweb logs stop after rotation
> > >> 
> > >> 
> > >> 
> > >> Hello Yehuda and the rest of the mailing list.
> > >> 
> > >> 
> > >> My main question currently is why are the bucket index and the object
> > >> manifest ever different? Based on how we are uploading data I do not
> > >> think
> > >> that the rados gateway should ever know the full file size without
> > >> having
> > >> all of the objects within ceph at one point in time. So after the
> > >> multipart
> > >> is marked as completed Rados gateway should cat through all of the
> > >> objects
> > >> and make a complete part, correct?
> > > 
> > > That's what *should* happen, but obviously there's some bug there.
> > > 
> > >> 
> > >> 
> > >> 
> > >> Secondly,
> > >> 
> > >> I think I am not understanding the process to grab all of the parts
> > >> correctly. To continue to use my example file
> > >> "86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam"
> > >> in bucket tcga_cghub_protected. I would be using the following to grab
> > >> the
> > >> prefix:
> > >> 
> > >> 
> > >> prefix=$(radosgw-admin object stat --bucket=tcga_cghub_protected
> > >> --object=86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam
> > >> | grep -iE '"prefix"' | awk -F"\"" '{print $4}')
> > >> 
> > >> 
> > >> Which should take everything between quotes for the prefix key and give
> > >> me
> > >> the value.
> > >> 
> > >> 
> > >> In this case::
> > >> 
> > >> "prefix":
> > >> "86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S",
> > >> 
> > >> 
> > >> So
> > >> 
> > >> lacadmin@kh10-9:~$ echo ${prefix}
> > >> 
> > >> 86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S
> > >> 
> > >> 
> > >> From here I list all of the objects in the .rgw.buckets pool and grep
> > >> for
> > >> that said prefix which yields 1335 objects. From here if I cat all of
> > >> these
> > >> objects together I only end up with a 5468160 byte file which is 2G
> > >> short
> > >> of
> > >> what the object manifest says it should be. If I grab the file and tail
> > >> the
> > >> Rados gateway log I end up with 1849 objects and when I sum them all up
> > >> I
> > > 
> > > How are these objects named?
> > > 
> > >> end up with 7744771642 which is the same size that the manifest reports.
> > >> I
> > >> understand that this does nothing other than verify the manifests
> > >> accuracy
> > >> but I still find it interesting. The missing chunks may still exist in
> > >> ceph
> > >> outside of the object manifest and tagged with the same prefix, correct?
> > >> Or
> > >> am I misunderstanding something?
> > > 
> > > Either it's missing a chunk, or one of the objects is truncated. Can you
> > > stat all the parts? I expect most of the objects to have two different
> > > sizes (e.g., 4MB, 1MB), but at it is likely that the last part is
> > > smaller,
> > > and maybe another object that is missing 512k.
> > > 
> > >> 
> > >> 
> > >> We have over 40384 files in the tcga_cghub_protected bucket and only 66
> > >> of
> > >> these files are suffering from this truncation issue. What I need to
> > >> know
> > >> is: is this happening on the gateway side or on the client side? Next I
> > >> need
> > >> to know what possible actions can occur where the bucket index and the
> > >> object manifest would be mismatched like this as 40318 out of 40384 are
> > >> working without issue.
> > >> 
> > >> 
> > >> The truncated files are of all different sizes (5 megabytes - 980
> > >> gigabytes)
> > >> and the truncation seems to be all over. By "all over" I mean some files
> > >> are
> > >> missing the first few bytes that should read "bam" and some are missing
> > >> parts in the middle.
> > > 
> > > Can you give an example of an object manifest for a broken object, and
> > > all
> > > the rados objects that build it (e.g., the output of 'rados stat' on
> > > these
> > > objects). A smaller object might be easier.
> > > 
> > >> 
> > >> 
> > >> So our upload code is using mmap to stream chunks of the file to the
> > >> Rados
> > >> gateway via a multipart upload but no where on the client side do we
> > >> have
> > >> a
> > >> direct reference to the files we are using nor do we specify the size in
> > >> anyway. So where is the gateway getting the correct complete filesize
> > >> from
> > >> and how is the bucket index showing the intended file size?
> > >> 
> > >> 
> > >> This implies that, at some point in time, ceph was able to see all of
> > >> the
> > >> parts of the file and calculate the correct total size. This to me seems
> > >> like a rados gateway bug regardless of how the file is being uploaded. I
> > >> think that the RGW should be able to be fuzzed and still store the data
> > >> correctly.
> > >> 
> > >> 
> > >> Why is the bucket list not matching the bucket index and how can I
> > >> verify
> > >> that the data is not being corrupted by the RGW or worse, after it is
> > >> committed to ceph ?
> > > 
> > > That's what we're trying to find out.
> > > 
> > > Thanks,
> > > Yehuda
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com