Re: boost recoverystate handle_log fault

huang jun <hjwsm1989@xxxxxxxxx> · Thu, 4 Aug 2011 21:35:31 +0800

hi,sam
we meet this problem many times recently.
we use ceph v0.30 and linux 2.6.37.
we build a cluster with 20 OSDs.But not all the OSDs are started. at
first we just start 10 osds. and then we use 10 kernel clients to
write 20GB data in total.
now we start another 10 osds by "/etc/init.d/ceph start " (almost at
same time), then unusual things happened. a few osds down, there are
many crashed PGs in pg dump, the OSD debug file. see attach file
"osd.17.log".

i have some questions:
1) did your test term test recovery performance by concurrently adding
10 or more OSDs?
   if so, everything works well?
2) the "Crashed" state was casted in "Initial" state or "Started"
state? we think it Initial state. we can see it exits Started state.
3) what events that lead the statemachine going to Crashed occured ?
 what  did" inner_constructor<boost::mpl::l_item<mpl_::long_<1l>,
PG::RecoveryState::Crashed, boost::mpl::l_end> " mean here? i can't
find  introductions about this.

thanks in advance!

2011/7/12 Samuel Just <samuelj@xxxxxxxxxxxxxxx>:
> The messenger errors probably indicate that that OSD's peers are down.
> The boost errors are a result of the OSD receiving a log message in the
> GetInfo state.  This indicates a bug in the peering state machine.  Is
> there a way that you could get us more complete logs?  I need an idea
> of what happened to cause the erroneous log message.
>
> Thanks!
> -Sam
>
> On 07/11/2011 07:26 AM, huang jun wrote:
>>
>> hi,all
>> I use ceph v0.30 on 31osds, on linux 2.6.37
>> after i set up the whole cluster, there are many (10) osds going down
>> because the cosd process was killed, and we can provide the osd log
>> in attach file "osd-failed".
>>
>> and this phenomenon occured once a week ago.At first we fixed it
>> by just rebuilding the cluster, but this time we will not try that method.
>> we want to find where lead this failed happen.
>> why did the simplemessenger always send RETSETSESSION?
>> whar lead the boost:recovery failed ? can you give some constructive
>> advices?
>>
>> thanks in advance
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Attachment:
osd.17.log

Description: Binary data