rejoin mds daemon, 17 osd are suddenly down and out

AnnyRen <annyren6@xxxxxxxxx> · Thu, 14 Jul 2011 11:54:47 +0800

Hi, developers:

My environment is 3 mons, 2 mds, 25 osds, and ceph version is v0.30.
This morning, I found one mds(standby one) lost when I run "ceph -w"

The original mds info should be  {mds e42: 1/1/1 up {0=b=up:active}, 1
up:standby}
but I found the standby one lost

so I ssh to mds1 to run "cmds -i a -c /etc/ceph/ceph.conf" , and check
the mds daemon run correctly.

After then, I found 17 osds suddenly down and out,
---------------------------------------------------------------------------------------------------------
osd0 up   in  weight 1 up_from 139 up_thru 229 down_at 138
last_clean_interval 120-137 192.168.10.10:6800/11191
192.168.10.10:6801/11191 192.168.10.10:6802/11191
osd1 down out up_from 141 up_thru 160 down_at 166 last_clean_interval 117-139
osd2 up   in  weight 1 up_from 147 up_thru 222 down_at 146
last_clean_interval 125-145 192.168.10.12:6800/10173
192.168.10.12:6801/10173 192.168.10.12:6802/10173
osd3 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 130-152
osd4 down out up_from 155 up_thru 160 down_at 166 last_clean_interval 130-153
osd5 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 134-151
osd6 down out up_from 153 up_thru 160 down_at 170 last_clean_interval 135-151
osd7 down out up_from 157 up_thru 160 down_at 162 last_clean_interval 133-155
osd8 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 134-152
osd9 down out up_from 155 up_thru 160 down_at 168 last_clean_interval 133-153
osd10 down out up_from 141 up_thru 160 down_at 168 last_clean_interval 118-139
osd11 down out up_from 145 up_thru 160 down_at 168 last_clean_interval 119-143
osd12 down out up_from 142 up_thru 160 down_at 166 last_clean_interval 123-140
osd13 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 123-145
osd14 up   in  weight 1 up_from 143 up_thru 223 down_at 142
last_clean_interval 124-141 192.168.10.24:6800/10122
192.168.10.24:6801/10122 192.168.10.24:6802/10122
osd15 down out up_from 147 up_thru 160 down_at 164 last_clean_interval 121-145
osd16 up   in  weight 1 up_from 148 up_thru 222 down_at 147
last_clean_interval 124-146 192.168.10.26:6800/9881
192.168.10.26:6801/9881 192.168.10.26:6802/9881
osd17 up   in  weight 1 up_from 148 up_thru 223 down_at 147
last_clean_interval 122-146 192.168.10.27:6800/9986
192.168.10.27:6801/9986 192.168.10.27:6802/9986
osd18 down out up_from 146 up_thru 160 down_at 168 last_clean_interval 124-144
osd19 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 125-145
osd20 up   in  weight 1 up_from 148 up_thru 222 down_at 147
last_clean_interval 126-146 192.168.10.30:6800/9816
192.168.10.30:6801/9816 192.168.10.30:6802/9816
osd21 down out up_from 148 up_thru 160 down_at 168 last_clean_interval 126-146
osd22 up   in  weight 1 up_from 149 up_thru 220 down_at 148
last_clean_interval 127-147 192.168.10.32:6800/9640
192.168.10.32:6801/9640 192.168.10.32:6802/9640
osd23 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 128-151
osd24 up   in  weight 1 up_from 150 up_thru 225 down_at 149
last_clean_interval 132-148 192.168.10.34:6800/10581
192.168.10.34:6801/10581 192.168.10.34:6802/10581
---------------------------------------------------------------------------------------------------------

Many pgs are degraded, I ssh to every down and out osd host to see log
(/var/log/ceph/osd.x.log), but there is nothing recorded in logs...
Why the logs stopped logging anything?

So I run "cosd -i x -c /etc/ceph/ceph.conf" on every down osd
individually to make them up and in, and then I can read/write files
form ceph....

Could anyone tell me what's going on in my environment?
1. OSD Stability problem?
2. Ceph didn't write logs unexpectedly.

 Thanks a lot!  :)
[global]
        pid file = /var/run/ceph/$name.pid
        ; debug ms =1

        ; enable secure authentication
        ; auth supported = cephx

[mon]
        mon data = /mon/mon$id
        debug mon = 0

[mon.a]
        host = MON1
        mon addr = 192.168.10.1:6789

[mon.b]
        host = MON2
        mon addr = 192.168.10.2:6789

[mon.c]
        host = MON3
        mon addr = 192.168.10.3:6789

[mds]

        debug mds = 0

[mds.a]
        host = MDS1
[mds.b]
        host = MDS2

[osd]

        osd data = /mnt/ext4/osd$id

        osd journal = /data/osd$id/journal
        osd journal size = 512 ; journal size, in megabytes

        filestore btrfs snap = false
        filestore fsync flushes journal data = true

        debug osd = 0

[osd.0]
        host = OSD1

[osd.1]
        host = OSD2

[osd.2]
        host = OSD3

[osd.3]
        host = OSD4

[osd.4]
        host = OSD5

[osd.5]
        host = OSD6

[osd.6]
        host = OSD7

[osd.7]
        host = OSD8

[osd.8]
        host = OSD9

[osd.9]
        host = OSD10

[osd.10]
        host = OSD11

[osd.11]
        host = OSD12

[osd.12]
        host = OSD13

[osd.13]
        host = OSD14

[osd.14]
        host = OSD15

[osd.15]
        host = OSD16

[osd.16]
        host = OSD17

[osd.17]
        host = OSD18

[osd.18]
        host = OSD19

[osd.19]
        host = OSD20

[osd.20]
        host = OSD21

[osd.21]
        host = OSD22

[osd.22]
        host = OSD23

[osd.23]
        host = OSD24

[osd.24]
        host = OSD25