Re: rejoin mds daemon, 17 osd are suddenly down and out

AnnyRen <annyren6@xxxxxxxxx> · Fri, 15 Jul 2011 10:32:53 +0800

2011/7/15 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>:
> I do notice that (unless mds.a is configured to be a standby in the
Thank you for your reply.

How can I configure mds.a to be a standby in ceph.conf?

> config file) you're starting up another MDS and claiming it's the same
> as the already-running one. This shouldn't cause the OSDc to crash but
> might be revealing a bug.
>
> On Thu, Jul 14, 2011 at 10:11 AM, Samuel Just <samuelj@xxxxxxxxxxxxxxx> wrote:
>> Not sure about the lack of logs, is something rotating the logs?  There
>> could have been a bug that caused the osds to crash, but I'll need the logs
>> to hazard a guess as to what caused it.  Starting the mds that way should
>> not have killed the osds.  Do the running osds produce logs?  The logging
>> should default to /var/log/ceph/.
>> -Sam
>>
>> On 07/13/2011 08:54 PM, AnnyRen wrote:
>>>
>>> Hi, developers:
>>>
>>> My environment is 3 mons, 2 mds, 25 osds, and ceph version is v0.30.
>>> This morning, I found one mds(standby one) lost when I run "ceph -w"
>>>
>>> The original mds info should be  {mds e42: 1/1/1 up {0=b=up:active}, 1
>>> up:standby}
>>> but I found the standby one lost
>>>
>>> so I ssh to mds1 to run "cmds -i a -c /etc/ceph/ceph.conf" , and check
>>> the mds daemon run correctly.
>>>
>>> After then, I found 17 osds suddenly down and out,
>>>
>>> ---------------------------------------------------------------------------------------------------------
>>> osd0 up   in  weight 1 up_from 139 up_thru 229 down_at 138
>>> last_clean_interval 120-137 192.168.10.10:6800/11191
>>> 192.168.10.10:6801/11191 192.168.10.10:6802/11191
>>> osd1 down out up_from 141 up_thru 160 down_at 166 last_clean_interval
>>> 117-139
>>> osd2 up   in  weight 1 up_from 147 up_thru 222 down_at 146
>>> last_clean_interval 125-145 192.168.10.12:6800/10173
>>> 192.168.10.12:6801/10173 192.168.10.12:6802/10173
>>> osd3 down out up_from 154 up_thru 160 down_at 166 last_clean_interval
>>> 130-152
>>> osd4 down out up_from 155 up_thru 160 down_at 166 last_clean_interval
>>> 130-153
>>> osd5 down out up_from 153 up_thru 160 down_at 166 last_clean_interval
>>> 134-151
>>> osd6 down out up_from 153 up_thru 160 down_at 170 last_clean_interval
>>> 135-151
>>> osd7 down out up_from 157 up_thru 160 down_at 162 last_clean_interval
>>> 133-155
>>> osd8 down out up_from 154 up_thru 160 down_at 166 last_clean_interval
>>> 134-152
>>> osd9 down out up_from 155 up_thru 160 down_at 168 last_clean_interval
>>> 133-153
>>> osd10 down out up_from 141 up_thru 160 down_at 168 last_clean_interval
>>> 118-139
>>> osd11 down out up_from 145 up_thru 160 down_at 168 last_clean_interval
>>> 119-143
>>> osd12 down out up_from 142 up_thru 160 down_at 166 last_clean_interval
>>> 123-140
>>> osd13 down out up_from 147 up_thru 160 down_at 166 last_clean_interval
>>> 123-145
>>> osd14 up   in  weight 1 up_from 143 up_thru 223 down_at 142
>>> last_clean_interval 124-141 192.168.10.24:6800/10122
>>> 192.168.10.24:6801/10122 192.168.10.24:6802/10122
>>> osd15 down out up_from 147 up_thru 160 down_at 164 last_clean_interval
>>> 121-145
>>> osd16 up   in  weight 1 up_from 148 up_thru 222 down_at 147
>>> last_clean_interval 124-146 192.168.10.26:6800/9881
>>> 192.168.10.26:6801/9881 192.168.10.26:6802/9881
>>> osd17 up   in  weight 1 up_from 148 up_thru 223 down_at 147
>>> last_clean_interval 122-146 192.168.10.27:6800/9986
>>> 192.168.10.27:6801/9986 192.168.10.27:6802/9986
>>> osd18 down out up_from 146 up_thru 160 down_at 168 last_clean_interval
>>> 124-144
>>> osd19 down out up_from 147 up_thru 160 down_at 166 last_clean_interval
>>> 125-145
>>> osd20 up   in  weight 1 up_from 148 up_thru 222 down_at 147
>>> last_clean_interval 126-146 192.168.10.30:6800/9816
>>> 192.168.10.30:6801/9816 192.168.10.30:6802/9816
>>> osd21 down out up_from 148 up_thru 160 down_at 168 last_clean_interval
>>> 126-146
>>> osd22 up   in  weight 1 up_from 149 up_thru 220 down_at 148
>>> last_clean_interval 127-147 192.168.10.32:6800/9640
>>> 192.168.10.32:6801/9640 192.168.10.32:6802/9640
>>> osd23 down out up_from 153 up_thru 160 down_at 166 last_clean_interval
>>> 128-151
>>> osd24 up   in  weight 1 up_from 150 up_thru 225 down_at 149
>>> last_clean_interval 132-148 192.168.10.34:6800/10581
>>> 192.168.10.34:6801/10581 192.168.10.34:6802/10581
>>>
>>> ---------------------------------------------------------------------------------------------------------
>>>
>>> Many pgs are degraded, I ssh to every down and out osd host to see log
>>> (/var/log/ceph/osd.x.log), but there is nothing recorded in logs...
>>> Why the logs stopped logging anything?
>>>
>>> So I run "cosd -i x -c /etc/ceph/ceph.conf" on every down osd
>>> individually to make them up and in, and then I can read/write files
>>> form ceph....
>>
>> At this point, did the osds go down again?
>>>
>>> Could anyone tell me what's going on in my environment?
>>> 1. OSD Stability problem?
>>> 2. Ceph didn't write logs unexpectedly.
>>>
>>>  Thanks a lot!  :)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html