Hey Yehuda, I work with Sean on the dev side. We thought we should put together a short report on what we’ve been seeing in the hopes that the behavior might make some sense to you. We had originally noticed these issues a while ago with our first iteration of this particular Ceph deployment. The issues we had seen were characterized by two different behaviors: • Some objects would appear truncated, returning different sizes for each request. Repeated attempts would eventually result in a successful retrieval if the second behavior doesn’t apply. • Some objects would always appear truncated, missing an integer multiple of 512KB. This is where the report that we are encountering ‘truncation’ came from, which is slightly misleading. We recently verified that we are indeed encountering the first behavior, for which I believe Sean has supplied or will be supplying Ceph logs showcasing the server-side errors, and is true truncation. However, the second behavior is not really truncation, but missing 512KB chunks, as Sean has brought up. We’ve had some luck with identifying some of the patterns that are seemingly related to this issue. Without going into too great of detail, we’ve found the following appear to hold true for all objects affected by the second behavior: • The amount of data missing is always in integer multiples of 512KB. • The expected file size is always found via the bucket index. • Ceph objects do not appear to be missing chunks or have holes in them. • The missing 512KB chunks are always at the beginning of multipart segments (1GB in our case). • For large files missing multiple chunks, the segments affected appear to be clustered and contiguous. The first pattern was identified when we noticed that the bucket index and the object manifest differed in reported size. This is useful as an quick method of identifying affected objects. We’ve used this to avoid having to pull down and check each object individually. In total, we have 108 affected objects, which translates to approximately 0.25% of our S3 objects. We noticed that the bucket index always reports the object size that would be expected had the upload gone correctly. Since we only ever report the segment sizes to the gateway, this would suggest that the segment sizes were reported accurately and aggregated correctly server side. Sean identified the Ceph objects that compose one of our affected S3 objects. We thought we might see the first Ceph object missing some data, but found it to be a full 4MB. Retrieving the first Ceph object and comparing it to the bytes in the corresponding file, it appears that the Ceph object matches the 4MB of the file after the first 512KB. We took this as evidence that the data was never getting to Ceph in the first place. However, in our testing, we were unable to get the gateway to accept segments with less data than reported. Dissecting some of the affected objects, we were able to identify the positions of the missing 512KB segments. The positions align with the starting positions of 1GB segments, matching our multipart segment length. Additionally, for files where we had a large number of missing chunks, the segments appeared to cluster (e.g. segments 161-163 and 170-189 in a ~220 segment file). However, there did not appear to be a pattern as to which segments were affected. Some of our testing: One of our first thoughts was that it might be only specific files, perhaps particular file sizes, that caused the behavior to manifest. We generated fresh copies of the files in question and re-uploaded them without error. We repeatedly uploaded the same object, and always received positive results. We next asked whether this could be connection related. We simulated a connection drop by killing the upload processes early. This resulted in uncompleted multipart uploads, as expected. Completing these resulted in truncated objects with a mismatch between the bucket index and the object manifest. This is the only test that resulted in this size mismatch behavior, however the objects were missing data from the end of the file, specifically the last segment - actual truncation vs the missing 512KB at the beginning of segments. We then attempted to forge false multipart requests in an attempt to replicate the behavior assuming a bad client. Providing data differing than what was used to produce the md5 digest always resulted in a BadDigest error. Forging the content-length header but providing the correct md5 digest, to simulate a portion of a file not being sent, would cause a short hang and then a PermissionDenied error. In no cases were we able to get a positive response from the gateway. What is confusing about all of this is that the bucket index is correct. That is, the index has the size expected had the upload completed correctly. This would suggest that the gateway correctly inferred the full size from the segment sizes, which means that our upload had reported correctly the full file size via segment sizes. But every time we’ve attempted to send less data than is reported to the gateway, as would be the case if the first 512KB were never sent, we encounter an error. We’re still working to track down where the issue is. We’re hoping it’s just something in the client that hiccuped, but any insight into these behaviors from a gateway perspective would be greatly appreciated. Thanks, Mark > On May 7, 2015, at 12:09 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: > > > > ----- Original Message ----- >> From: "Sean" <seapasulli@xxxxxxxxxxxx> >> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx >> Sent: Tuesday, May 5, 2015 12:14:19 PM >> Subject: Re: Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation >> >> >> >> Hello Yehuda and the rest of the mailing list. >> >> >> My main question currently is why are the bucket index and the object >> manifest ever different? Based on how we are uploading data I do not think >> that the rados gateway should ever know the full file size without having >> all of the objects within ceph at one point in time. So after the multipart >> is marked as completed Rados gateway should cat through all of the objects >> and make a complete part, correct? > > That's what *should* happen, but obviously there's some bug there. > >> >> >> >> Secondly, >> >> I think I am not understanding the process to grab all of the parts >> correctly. To continue to use my example file >> "86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam" >> in bucket tcga_cghub_protected. I would be using the following to grab the >> prefix: >> >> >> prefix=$(radosgw-admin object stat --bucket=tcga_cghub_protected >> --object=86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam >> | grep -iE '"prefix"' | awk -F"\"" '{print $4}') >> >> >> Which should take everything between quotes for the prefix key and give me >> the value. >> >> >> In this case:: >> >> "prefix": >> "86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S", >> >> >> So >> >> lacadmin@kh10-9:~$ echo ${prefix} >> >> 86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S >> >> >> From here I list all of the objects in the .rgw.buckets pool and grep for >> that said prefix which yields 1335 objects. From here if I cat all of these >> objects together I only end up with a 5468160 byte file which is 2G short of >> what the object manifest says it should be. If I grab the file and tail the >> Rados gateway log I end up with 1849 objects and when I sum them all up I > > How are these objects named? > >> end up with 7744771642 which is the same size that the manifest reports. I >> understand that this does nothing other than verify the manifests accuracy >> but I still find it interesting. The missing chunks may still exist in ceph >> outside of the object manifest and tagged with the same prefix, correct? Or >> am I misunderstanding something? > > Either it's missing a chunk, or one of the objects is truncated. Can you stat all the parts? I expect most of the objects to have two different sizes (e.g., 4MB, 1MB), but at it is likely that the last part is smaller, and maybe another object that is missing 512k. > >> >> >> We have over 40384 files in the tcga_cghub_protected bucket and only 66 of >> these files are suffering from this truncation issue. What I need to know >> is: is this happening on the gateway side or on the client side? Next I need >> to know what possible actions can occur where the bucket index and the >> object manifest would be mismatched like this as 40318 out of 40384 are >> working without issue. >> >> >> The truncated files are of all different sizes (5 megabytes - 980 gigabytes) >> and the truncation seems to be all over. By "all over" I mean some files are >> missing the first few bytes that should read "bam" and some are missing >> parts in the middle. > > Can you give an example of an object manifest for a broken object, and all the rados objects that build it (e.g., the output of 'rados stat' on these objects). A smaller object might be easier. > >> >> >> So our upload code is using mmap to stream chunks of the file to the Rados >> gateway via a multipart upload but no where on the client side do we have a >> direct reference to the files we are using nor do we specify the size in >> anyway. So where is the gateway getting the correct complete filesize from >> and how is the bucket index showing the intended file size? >> >> >> This implies that, at some point in time, ceph was able to see all of the >> parts of the file and calculate the correct total size. This to me seems >> like a rados gateway bug regardless of how the file is being uploaded. I >> think that the RGW should be able to be fuzzed and still store the data >> correctly. >> >> >> Why is the bucket list not matching the bucket index and how can I verify >> that the data is not being corrupted by the RGW or worse, after it is >> committed to ceph ? > > That's what we're trying to find out. > > Thanks, > Yehuda > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com