Re: peering PGs

Christian Brunner <chb@xxxxxx> · Tue, 26 Jul 2011 20:46:02 +0200

2011/7/26 Sage Weil <sage@xxxxxxxxxxxx>:
> On Tue, 26 Jul 2011, Christian Brunner wrote:
>> OK  I've solved this by myself.
>>
>> Since I knew that ther is replication between
>>
>> osd001 and osd005,
>>
>> as well as
>>
>> osd001 and osd015,
>> osd001 and osd012,
>>
>> I decided to take osd005, osd012 and osd015 offline. After that ceph
>> started to rebuild the PGs on other nodes.
>
> At the same time you mean?  Or just restarted them?

At the same time.

> The usual way to debug these situations is:
>
>  - identify a stuck pg
>  - figure out what osds it maps to.  [15,1]
>  - turn on logs on those nodes:
>    ceph osd tell 15 injectargs '--debug-osd 20 --debug-ms 1'
>    ceph osd tell 1 injectargs '--debug-osd 20 --debug-ms 1'
>  - restart peering by togging the primary (first osd, 15)
>    ceph osd down 15
>  - send us the resulting logs (for all nodes)
>
> Even better if you also include other (old) osds that include pg data
> (osd1 in your case) in this.
>
> We definitely want to fix the core issue, so any help gathering the logs
> would be appreciated!  It's also possible that the above will 'fix' it
> because the peering issue is hard to hit.  In that case, cranking up the
> debug level after the initial crash but before you restart everything
> might be a good idea.

I will turn on debuging next time.

I think it is possible to hit the issue, when an osd that is the
destination of a rebuild, fails at the time a rebuild is performed.
But I have not verified this.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html