Re: RadosGW problems with copy in s3

Sławomir Skowron <szibis@xxxxxxxxx> · Tue, 6 Mar 2012 11:08:59 +0100



2012/3/5 Sławomir Skowron <szibis@xxxxxxxxx>:
> On 5 mar 2012, at 19:59, Yehuda Sadeh Weinraub
> <yehuda.sadeh@xxxxxxxxxxxxx> wrote:
>
>> On Mon, Mar 5, 2012 at 2:23 AM, Sławomir Skowron
>> <slawomir.skowron@xxxxxxxxx> wrote:
>>> 2012/3/1 Sławomir Skowron <slawomir.skowron@xxxxxxxxx>:
>>>> 2012/2/29 Yehuda Sadeh Weinraub <yehuda.sadeh@xxxxxxxxxxxxx>:
>>>>> On Wed, Feb 29, 2012 at 5:06 AM, Sławomir Skowron
>>>>> <slawomir.skowron@xxxxxxxxx> wrote:
>>>>>>
>>>>>> Ok, it's intentional.
>>>>>>
>>>>>> We are checking meta info about files, then, checking md5 of file
>>>>>> content. In parallel, updating object that have change, and then
>>>>>> archiving this objects in another key, and last thing is deleting
>>>>>> objects that expires.
>>>>>>
>>>>>> This happens over and over, because, this site is changing many times.
>>>>>>
>>>>>> Now i don't have any idea, how to workaround this problem, without
>>>>>> shutdown this app :(
>>>>>
>>>>> I looked at your osd log again, and there are other things that don't
>>>>> look right. I'll also need you to turn on 'debug osd = 20' and 'debug
>>>>> filestore = 20'.
>>>>
>>>> osd.24 almost 10 minutes of log in debug, as above in attachment.
>>>>
>>>>> Other than that, I just pushed a workaround that might improve things.
>>>>> It's on the wip-rgw-atomic-no-retry branch on github (based on
>>>>> 0.42.2), so you might want to give it a spin and let us know whether
>>>>> it actually improved things.
>>>>
>>>> Ok i will try, and let you know soon.
>>>
>>> Unfortunately, no improvment after upgrade for this version.
>>>
>> It looks like an issue with updating the bucket index, but I'm having
>> trouble confirming it, as the log provided (of osd.24) doesn't contain
>> any relevant operations. If you could provide a log from the relevant
>> osd it may be very helpful.
>>
>> You can find the relevant osd by looking at an operation that took too
>> long, and look for a request like the following:
>>
>> 2012-02-28 20:20:10.944859 7fb1affb7700 -- 10.177.64.6:0/1020439 -->
>> 10.177.64.4:6839/7954 -- osd_op(client.65007.0:587 .dir.3 [call
>> rgw.bucket_prepare_op] 7.ccb26a35) v4 -- ?+0 0xf25270 con 0xbcd1c0
>>
>> It would be easiest looking for the reply to that request as it will
>> contain the osd id (search for a line that contains osd_op_reply and
>> the client.65007.0:587 request id).
>>
>> In the mean time, I created issue #2139 for a probable culprit. Having
>> the relevant logs will allow us to verify whether you're hitting that
>> or another issue.
>>
>> Thanks,
>> Yehuda
>
> Ok, because of time difference between as i will try too find this on
> the morning in job. If there will be insufficient verbosity of logs i
> will try too start all OSD in debug, as you write earlier, and then
> generate, the problem again.
> I try to send logs as soon as possible.
>
> Regards
> Slawomir Skowron

All logs from osd.24, osd.62, and osd.36 with osd debug =20 and
filestore debug = 20 from 2012-03-06 10:25 and more.

http://217.144.195.170/ceph/osd.24.log.tar.gz   (348MB) - machine 1 - rack1
http://217.144.195.170/ceph/osd.36.log.tar.gz   (26MB) - machine 2 - rack 2
http://217.144.195.170/ceph/osd.62.log.tar.gz   (23MB) - machine3 - rack 3

-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html