Re: rejoin mds daemon, 17 osd are suddenly down and out

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 14 Jul 2011 10:46:06 -0700

I do notice that (unless mds.a is configured to be a standby in the
config file) you're starting up another MDS and claiming it's the same
as the already-running one. This shouldn't cause the OSDc to crash but
might be revealing a bug.

On Thu, Jul 14, 2011 at 10:11 AM, Samuel Just <samuelj@xxxxxxxxxxxxxxx> wrote:
> Not sure about the lack of logs, is something rotating the logs?  There
> could have been a bug that caused the osds to crash, but I'll need the logs
> to hazard a guess as to what caused it.  Starting the mds that way should
> not have killed the osds.  Do the running osds produce logs?  The logging
> should default to /var/log/ceph/.
> -Sam
>
> On 07/13/2011 08:54 PM, AnnyRen wrote:
>>
>> Hi, developers:
>>
>> My environment is 3 mons, 2 mds, 25 osds, and ceph version is v0.30.
>> This morning, I found one mds(standby one) lost when I run "ceph -w"
>>
>> The original mds info should be  {mds e42: 1/1/1 up {0=b=up:active}, 1
>> up:standby}
>> but I found the standby one lost
>>
>> so I ssh to mds1 to run "cmds -i a -c /etc/ceph/ceph.conf" , and check
>> the mds daemon run correctly.
>>
>> After then, I found 17 osds suddenly down and out,
>>
>> ---------------------------------------------------------------------------------------------------------
>> osd0 up   in  weight 1 up_from 139 up_thru 229 down_at 138
>> last_clean_interval 120-137 192.168.10.10:6800/11191
>> 192.168.10.10:6801/11191 192.168.10.10:6802/11191
>> osd1 down out up_from 141 up_thru 160 down_at 166 last_clean_interval
>> 117-139
>> osd2 up   in  weight 1 up_from 147 up_thru 222 down_at 146
>> last_clean_interval 125-145 192.168.10.12:6800/10173
>> 192.168.10.12:6801/10173 192.168.10.12:6802/10173
>> osd3 down out up_from 154 up_thru 160 down_at 166 last_clean_interval
>> 130-152
>> osd4 down out up_from 155 up_thru 160 down_at 166 last_clean_interval
>> 130-153
>> osd5 down out up_from 153 up_thru 160 down_at 166 last_clean_interval
>> 134-151
>> osd6 down out up_from 153 up_thru 160 down_at 170 last_clean_interval
>> 135-151
>> osd7 down out up_from 157 up_thru 160 down_at 162 last_clean_interval
>> 133-155
>> osd8 down out up_from 154 up_thru 160 down_at 166 last_clean_interval
>> 134-152
>> osd9 down out up_from 155 up_thru 160 down_at 168 last_clean_interval
>> 133-153
>> osd10 down out up_from 141 up_thru 160 down_at 168 last_clean_interval
>> 118-139
>> osd11 down out up_from 145 up_thru 160 down_at 168 last_clean_interval
>> 119-143
>> osd12 down out up_from 142 up_thru 160 down_at 166 last_clean_interval
>> 123-140
>> osd13 down out up_from 147 up_thru 160 down_at 166 last_clean_interval
>> 123-145
>> osd14 up   in  weight 1 up_from 143 up_thru 223 down_at 142
>> last_clean_interval 124-141 192.168.10.24:6800/10122
>> 192.168.10.24:6801/10122 192.168.10.24:6802/10122
>> osd15 down out up_from 147 up_thru 160 down_at 164 last_clean_interval
>> 121-145
>> osd16 up   in  weight 1 up_from 148 up_thru 222 down_at 147
>> last_clean_interval 124-146 192.168.10.26:6800/9881
>> 192.168.10.26:6801/9881 192.168.10.26:6802/9881
>> osd17 up   in  weight 1 up_from 148 up_thru 223 down_at 147
>> last_clean_interval 122-146 192.168.10.27:6800/9986
>> 192.168.10.27:6801/9986 192.168.10.27:6802/9986
>> osd18 down out up_from 146 up_thru 160 down_at 168 last_clean_interval
>> 124-144
>> osd19 down out up_from 147 up_thru 160 down_at 166 last_clean_interval
>> 125-145
>> osd20 up   in  weight 1 up_from 148 up_thru 222 down_at 147
>> last_clean_interval 126-146 192.168.10.30:6800/9816
>> 192.168.10.30:6801/9816 192.168.10.30:6802/9816
>> osd21 down out up_from 148 up_thru 160 down_at 168 last_clean_interval
>> 126-146
>> osd22 up   in  weight 1 up_from 149 up_thru 220 down_at 148
>> last_clean_interval 127-147 192.168.10.32:6800/9640
>> 192.168.10.32:6801/9640 192.168.10.32:6802/9640
>> osd23 down out up_from 153 up_thru 160 down_at 166 last_clean_interval
>> 128-151
>> osd24 up   in  weight 1 up_from 150 up_thru 225 down_at 149
>> last_clean_interval 132-148 192.168.10.34:6800/10581
>> 192.168.10.34:6801/10581 192.168.10.34:6802/10581
>>
>> ---------------------------------------------------------------------------------------------------------
>>
>> Many pgs are degraded, I ssh to every down and out osd host to see log
>> (/var/log/ceph/osd.x.log), but there is nothing recorded in logs...
>> Why the logs stopped logging anything?
>>
>> So I run "cosd -i x -c /etc/ceph/ceph.conf" on every down osd
>> individually to make them up and in, and then I can read/write files
>> form ceph....
>
> At this point, did the osds go down again?
>>
>> Could anyone tell me what's going on in my environment?
>> 1. OSD Stability problem?
>> 2. Ceph didn't write logs unexpectedly.
>>
>>  Thanks a lot!  :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html