Re: RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 20 Jan 2016 11:30:14 -0800

On Wed, Jan 20, 2016 at 10:43 AM, seapasulli@xxxxxxxxxxxx
<seapasulli@xxxxxxxxxxxx> wrote:
>
>
> On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> On Fri, Jan 15, 2016 at 5:04 PM, seapasulli@xxxxxxxxxxxx
>> <seapasulli@xxxxxxxxxxxx> wrote:
>>>
>>> I have looked all over and I do not see any explicit mention of
>>> "NWS_NEXRAD_NXL2DP_PAKC_20150101110000_20150101115959" in the logs nor do
>>> I
>>> see a timestamp from November 4th although I do see log rotations dating
>>> back to october 15th. I don't think it's possible it wasn't logged so I
>>> am
>>> going through the bucket logs from the 'radosgw-admin log show --object'
>>> side and I found the following::
>>>
>>> 4604932         {
>>> 4604933             "bucket": "noaa-nexrad-l2",
>>> 4604934             "time": "2015-11-04 21:29:27.346509Z",
>>> 4604935             "time_local": "2015-11-04 15:29:27.346509",
>>> 4604936             "remote_addr": "",
>>> 4604937             "object_owner": "b05f707271774dbd89674a0736c9406e",
>>> 4604938             "user": "b05f707271774dbd89674a0736c9406e",
>>> 4604939             "operation": "PUT",
>>
>> I'd expect a multipart upload completion to be done with a POST, not a
>> PUT.
>
> Indeed it seems really weird.
>>
>>
>>> 4604940             "uri":
>>>
>>> "\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_20150101110000_20150101115959.tar",
>>> 4604941             "http_status": "200",
>>> 4604942             "error_code": "",
>>> 4604943             "bytes_sent": 19,
>>> 4604944             "bytes_received": 0,
>>> 4604945             "object_size": 0,
>>
>> Do you see a zero object_size for other multipart uploads?
>
> I think so. I still don't know how to tell for certain if a radosgw object
> is a multipart object or not. I think all of the objects in noaa-nexrad-l2
> bucket are multipart::
>
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-        {
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
> "noaa-nexrad-l2",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16
> 19:49:30.579738Z",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
> "2015-10-16 14:49:30.579738",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
> "b05f707271774dbd89674a0736c9406e",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":
> "\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_20150113040000_20150113045959.tar",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status": "200",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received": 152,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
> "Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-        }
>
> The objects above (NWS_NEXRAD_NXL2DP_KGRK_20150113040000_20150113045959.tar)
> pulls down without an issue though. Below is a paste for object
> "NWS_NEXRAD_NXL2DP_KVBX_20150225160000_20150225165959.tar" which 404's::
> http://pastebin.com/Jtw8z7G4

Sadly the log doesn't provide all the input, but I can guess what the
operations were:

 - POST (init multipart upload)
 - PUT (upload part)
 - GET (list parts)
 - POST (complete multipart) <-- took > 57 seconds to process
 - POST (complete multipart)
 - HEAD (stat object)

For some reason the complete multipart operation took too long, which
I think triggered a client retry (either that, or an abort). Then
there were two completions racing (or a complete and abort), which
might have caused the issue we're seeing for some reason. E.g., two
completions might have ended up with the second completion noticing
that it's overwriting an existing object (that we just created),
sending the 'old' object to be garbage collected, when that object's
tail is actually its own tail.

>
> I see two posts one recorded a minute before for this object both with 0
> size though. Does this help at all?

Yes, very much

Thanks,
Yehuda
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com