I opened issue #11604, and have a fix for the issue. I updated our test suite to cover the specific issue that you were hitting. We'll backport the fix to both hammer and firefly soon. Thanks! Yehuda ----- Original Message ----- > From: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> > To: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx, "Sean Sullivan" <seapasulli@xxxxxxxxxxxx> > Sent: Tuesday, May 12, 2015 12:59:48 PM > Subject: Re: Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation > > Hi, > > Thank you for a very thorough investigation. See my comments below: > > ----- Original Message ----- > > From: "Mark Murphy" <murphymarkw@xxxxxxxxxxxx> > > To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> > > Cc: "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx > > Sent: Tuesday, May 12, 2015 10:50:49 AM > > Subject: Re: Civet RadosGW S3 not storing complete obects; > > civetweb logs stop after rotation > > > > Hey Yehuda, > > > > I work with Sean on the dev side. We thought we should put together a short > > report on what we’ve been seeing in the hopes that the behavior might make > > some sense to you. > > > > We had originally noticed these issues a while ago with our first iteration > > of this particular Ceph deployment. The issues we had seen were > > characterized by two different behaviors: > > > > • Some objects would appear truncated, returning different sizes for each > > request. Repeated attempts would eventually result in a successful > > retrieval if the second behavior doesn’t apply. > > This really sound like some kind of networking issue, maybe a load balancer > that is on the way that clobbers things? > > > • Some objects would always appear truncated, missing an integer multiple > > of > > 512KB. > > > > This is where the report that we are encountering ‘truncation’ came from, > > which is slightly misleading. We recently verified that we are indeed > > encountering the first behavior, for which I believe Sean has supplied or > > will be supplying Ceph logs showcasing the server-side errors, and is true > > truncation. However, the second behavior is not really truncation, but > > missing 512KB chunks, as Sean has brought up. > > > > We’ve had some luck with identifying some of the patterns that are > > seemingly > > related to this issue. Without going into too great of detail, we’ve found > > the following appear to hold true for all objects affected by the second > > behavior: > > > > • The amount of data missing is always in integer multiples of 512KB. > > • The expected file size is always found via the bucket index. > > • Ceph objects do not appear to be missing chunks or have holes in them. > > • The missing 512KB chunks are always at the beginning of multipart > > segments > > (1GB in our case). > > This matches some of my original suspicions. Here's some basic background > that might help clarify things: > > This looks like some kind of rgw bug. A radosgw object is usually composed of > two different parts: the object head, and the object tail. The head is > usually composed of the first 512k of data of the object (and never more > than that), and the tail has the rest of the object's data. However, the > head data part is optional, and it can be zero. For example, in the case of > multipart upload, after combining the parts, the head will not have any > data, and the tail will be compiled out of the different parts data. > However, when dealing with multipart parts, the parts do not really have a > head (due to their immutability), so it is expected that the part object > sizes to be 4MB. So it seems that for some reason these specific parts were > treated as if they had a head, although they shouldn't have. Now, that > brings me to the issue, where I noticed that some of the parts were retried. > When this happens, the part name is different than the default part name, so > there's a note in the manifest, and a special handling that start at > specific offsets. It might be that this is related, and the code that > handles the retries generate bad object parts. > > > > > • For large files missing multiple chunks, the segments affected appear to > > be clustered and contiguous. > > > > That would point at a cluster of retries, maybe due to networking issues > around the time these were created. > > > The first pattern was identified when we noticed that the bucket index and > > the object manifest differed in reported size. This is useful as an quick > > method of identifying affected objects. We’ve used this to avoid having to > > pull down and check each object individually. In total, we have 108 > > affected > > objects, which translates to approximately 0.25% of our S3 objects. > > > > We noticed that the bucket index always reports the object size that would > > be > > expected had the upload gone correctly. Since we only ever report the > > segment sizes to the gateway, this would suggest that the segment sizes > > were > > reported accurately and aggregated correctly server side. > > > > Sean identified the Ceph objects that compose one of our affected S3 > > objects. > > We thought we might see the first Ceph object missing some data, but found > > it to be a full 4MB. Retrieving the first Ceph object and comparing it to > > the bytes in the corresponding file, it appears that the Ceph object > > matches > > the 4MB of the file after the first 512KB. We took this as evidence that > > the > > data was never getting to Ceph in the first place. However, in our testing, > > we were unable to get the gateway to accept segments with less data than > > reported. > > > > Dissecting some of the affected objects, we were able to identify the > > positions of the missing 512KB segments. The positions align with the > > starting positions of 1GB segments, matching our multipart segment length. > > Additionally, for files where we had a large number of missing chunks, the > > segments appeared to cluster (e.g. segments 161-163 and 170-189 in a ~220 > > segment file). However, there did not appear to be a pattern as to which > > segments were affected. > > > > > > Some of our testing: > > > > One of our first thoughts was that it might be only specific files, perhaps > > particular file sizes, that caused the behavior to manifest. We generated > > fresh copies of the files in question and re-uploaded them without error. > > We > > repeatedly uploaded the same object, and always received positive results. > > > > We next asked whether this could be connection related. We simulated a > > connection drop by killing the upload processes early. This resulted in > > uncompleted multipart uploads, as expected. Completing these resulted in > > truncated objects with a mismatch between the bucket index and the object > > manifest. This is the only test that resulted in this size mismatch > > behavior, however the objects were missing data from the end of the file, > > specifically the last segment - actual truncation vs the missing 512KB at > > the beginning of segments. > > > > We then attempted to forge false multipart requests in an attempt to > > replicate the behavior assuming a bad client. Providing data differing than > > what was used to produce the md5 digest always resulted in a BadDigest > > error. Forging the content-length header but providing the correct md5 > > digest, to simulate a portion of a file not being sent, would cause a short > > hang and then a PermissionDenied error. In no cases were we able to get a > > positive response from the gateway. > > > > > > What is confusing about all of this is that the bucket index is correct. > > That > > is, the index has the size expected had the upload completed correctly. > > This > > would suggest that the gateway correctly inferred the full size from the > > segment sizes, which means that our upload had reported correctly the full > > file size via segment sizes. But every time we’ve attempted to send less > > data than is reported to the gateway, as would be the case if the first > > 512KB were never sent, we encounter an error. > > > > We’re still working to track down where the issue is. We’re hoping it’s > > just > > something in the client that hiccuped, but any insight into these behaviors > > from a gateway perspective would be greatly appreciated. > > > > Thanks, > > Mark > > Thank you, > > I'll also try to check things on my end, considering the new info. > > Yehuda > > > > > > > > On May 7, 2015, at 12:09 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> > > > wrote: > > > > > > > > > > > > ----- Original Message ----- > > >> From: "Sean" <seapasulli@xxxxxxxxxxxx> > > >> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> > > >> Cc: ceph-users@xxxxxxxxxxxxxx > > >> Sent: Tuesday, May 5, 2015 12:14:19 PM > > >> Subject: Re: Civet RadosGW S3 not storing complete obects; > > >> civetweb logs stop after rotation > > >> > > >> > > >> > > >> Hello Yehuda and the rest of the mailing list. > > >> > > >> > > >> My main question currently is why are the bucket index and the object > > >> manifest ever different? Based on how we are uploading data I do not > > >> think > > >> that the rados gateway should ever know the full file size without > > >> having > > >> all of the objects within ceph at one point in time. So after the > > >> multipart > > >> is marked as completed Rados gateway should cat through all of the > > >> objects > > >> and make a complete part, correct? > > > > > > That's what *should* happen, but obviously there's some bug there. > > > > > >> > > >> > > >> > > >> Secondly, > > >> > > >> I think I am not understanding the process to grab all of the parts > > >> correctly. To continue to use my example file > > >> "86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam" > > >> in bucket tcga_cghub_protected. I would be using the following to grab > > >> the > > >> prefix: > > >> > > >> > > >> prefix=$(radosgw-admin object stat --bucket=tcga_cghub_protected > > >> --object=86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam > > >> | grep -iE '"prefix"' | awk -F"\"" '{print $4}') > > >> > > >> > > >> Which should take everything between quotes for the prefix key and give > > >> me > > >> the value. > > >> > > >> > > >> In this case:: > > >> > > >> "prefix": > > >> "86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S", > > >> > > >> > > >> So > > >> > > >> lacadmin@kh10-9:~$ echo ${prefix} > > >> > > >> 86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S > > >> > > >> > > >> From here I list all of the objects in the .rgw.buckets pool and grep > > >> for > > >> that said prefix which yields 1335 objects. From here if I cat all of > > >> these > > >> objects together I only end up with a 5468160 byte file which is 2G > > >> short > > >> of > > >> what the object manifest says it should be. If I grab the file and tail > > >> the > > >> Rados gateway log I end up with 1849 objects and when I sum them all up > > >> I > > > > > > How are these objects named? > > > > > >> end up with 7744771642 which is the same size that the manifest reports. > > >> I > > >> understand that this does nothing other than verify the manifests > > >> accuracy > > >> but I still find it interesting. The missing chunks may still exist in > > >> ceph > > >> outside of the object manifest and tagged with the same prefix, correct? > > >> Or > > >> am I misunderstanding something? > > > > > > Either it's missing a chunk, or one of the objects is truncated. Can you > > > stat all the parts? I expect most of the objects to have two different > > > sizes (e.g., 4MB, 1MB), but at it is likely that the last part is > > > smaller, > > > and maybe another object that is missing 512k. > > > > > >> > > >> > > >> We have over 40384 files in the tcga_cghub_protected bucket and only 66 > > >> of > > >> these files are suffering from this truncation issue. What I need to > > >> know > > >> is: is this happening on the gateway side or on the client side? Next I > > >> need > > >> to know what possible actions can occur where the bucket index and the > > >> object manifest would be mismatched like this as 40318 out of 40384 are > > >> working without issue. > > >> > > >> > > >> The truncated files are of all different sizes (5 megabytes - 980 > > >> gigabytes) > > >> and the truncation seems to be all over. By "all over" I mean some files > > >> are > > >> missing the first few bytes that should read "bam" and some are missing > > >> parts in the middle. > > > > > > Can you give an example of an object manifest for a broken object, and > > > all > > > the rados objects that build it (e.g., the output of 'rados stat' on > > > these > > > objects). A smaller object might be easier. > > > > > >> > > >> > > >> So our upload code is using mmap to stream chunks of the file to the > > >> Rados > > >> gateway via a multipart upload but no where on the client side do we > > >> have > > >> a > > >> direct reference to the files we are using nor do we specify the size in > > >> anyway. So where is the gateway getting the correct complete filesize > > >> from > > >> and how is the bucket index showing the intended file size? > > >> > > >> > > >> This implies that, at some point in time, ceph was able to see all of > > >> the > > >> parts of the file and calculate the correct total size. This to me seems > > >> like a rados gateway bug regardless of how the file is being uploaded. I > > >> think that the RGW should be able to be fuzzed and still store the data > > >> correctly. > > >> > > >> > > >> Why is the bucket list not matching the bucket index and how can I > > >> verify > > >> that the data is not being corrupted by the RGW or worse, after it is > > >> committed to ceph ? > > > > > > That's what we're trying to find out. > > > > > > Thanks, > > > Yehuda > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com