Re: rejoin mds daemon, 17 osd are suddenly down and out

Samuel Just <samuelj@xxxxxxxxxxxxxxx> · Thu, 14 Jul 2011 10:11:56 -0700

Not sure about the lack of logs, is something rotating the logs?  There 
could have been a bug that caused the osds to crash, but I'll need the 
logs to hazard a guess as to what caused it.  Starting the mds that way 
should not have killed the osds.  Do the running osds produce logs?  The 
logging should default to /var/log/ceph/.
-Sam

On 07/13/2011 08:54 PM, AnnyRen wrote:
Hi, developers:

My environment is 3 mons, 2 mds, 25 osds, and ceph version is v0.30.
This morning, I found one mds(standby one) lost when I run "ceph -w"

The original mds info should be  {mds e42: 1/1/1 up {0=b=up:active}, 1
up:standby}
but I found the standby one lost

so I ssh to mds1 to run "cmds -i a -c /etc/ceph/ceph.conf" , and check
the mds daemon run correctly.

After then, I found 17 osds suddenly down and out,
---------------------------------------------------------------------------------------------------------
osd0 up   in  weight 1 up_from 139 up_thru 229 down_at 138
last_clean_interval 120-137 192.168.10.10:6800/11191
192.168.10.10:6801/11191 192.168.10.10:6802/11191
osd1 down out up_from 141 up_thru 160 down_at 166 last_clean_interval 117-139
osd2 up   in  weight 1 up_from 147 up_thru 222 down_at 146
last_clean_interval 125-145 192.168.10.12:6800/10173
192.168.10.12:6801/10173 192.168.10.12:6802/10173
osd3 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 130-152
osd4 down out up_from 155 up_thru 160 down_at 166 last_clean_interval 130-153
osd5 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 134-151
osd6 down out up_from 153 up_thru 160 down_at 170 last_clean_interval 135-151
osd7 down out up_from 157 up_thru 160 down_at 162 last_clean_interval 133-155
osd8 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 134-152
osd9 down out up_from 155 up_thru 160 down_at 168 last_clean_interval 133-153
osd10 down out up_from 141 up_thru 160 down_at 168 last_clean_interval 118-139
osd11 down out up_from 145 up_thru 160 down_at 168 last_clean_interval 119-143
osd12 down out up_from 142 up_thru 160 down_at 166 last_clean_interval 123-140
osd13 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 123-145
osd14 up   in  weight 1 up_from 143 up_thru 223 down_at 142
last_clean_interval 124-141 192.168.10.24:6800/10122
192.168.10.24:6801/10122 192.168.10.24:6802/10122
osd15 down out up_from 147 up_thru 160 down_at 164 last_clean_interval 121-145
osd16 up   in  weight 1 up_from 148 up_thru 222 down_at 147
last_clean_interval 124-146 192.168.10.26:6800/9881
192.168.10.26:6801/9881 192.168.10.26:6802/9881
osd17 up   in  weight 1 up_from 148 up_thru 223 down_at 147
last_clean_interval 122-146 192.168.10.27:6800/9986
192.168.10.27:6801/9986 192.168.10.27:6802/9986
osd18 down out up_from 146 up_thru 160 down_at 168 last_clean_interval 124-144
osd19 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 125-145
osd20 up   in  weight 1 up_from 148 up_thru 222 down_at 147
last_clean_interval 126-146 192.168.10.30:6800/9816
192.168.10.30:6801/9816 192.168.10.30:6802/9816
osd21 down out up_from 148 up_thru 160 down_at 168 last_clean_interval 126-146
osd22 up   in  weight 1 up_from 149 up_thru 220 down_at 148
last_clean_interval 127-147 192.168.10.32:6800/9640
192.168.10.32:6801/9640 192.168.10.32:6802/9640
osd23 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 128-151
osd24 up   in  weight 1 up_from 150 up_thru 225 down_at 149
last_clean_interval 132-148 192.168.10.34:6800/10581
192.168.10.34:6801/10581 192.168.10.34:6802/10581
---------------------------------------------------------------------------------------------------------

Many pgs are degraded, I ssh to every down and out osd host to see log
(/var/log/ceph/osd.x.log), but there is nothing recorded in logs...
Why the logs stopped logging anything?

So I run "cosd -i x -c /etc/ceph/ceph.conf" on every down osd
individually to make them up and in, and then I can read/write files
form ceph....
At this point, did the osds go down again?

Could anyone tell me what's going on in my environment?
1. OSD Stability problem?
2. Ceph didn't write logs unexpectedly.

  Thanks a lot!  :)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html