Hi, developers: My environment is 3 mons, 2 mds, 25 osds, and ceph version is v0.30. This morning, I found one mds(standby one) lost when I run "ceph -w" The original mds info should be {mds e42: 1/1/1 up {0=b=up:active}, 1 up:standby} but I found the standby one lost so I ssh to mds1 to run "cmds -i a -c /etc/ceph/ceph.conf" , and check the mds daemon run correctly. After then, I found 17 osds suddenly down and out, --------------------------------------------------------------------------------------------------------- osd0 up in weight 1 up_from 139 up_thru 229 down_at 138 last_clean_interval 120-137 192.168.10.10:6800/11191 192.168.10.10:6801/11191 192.168.10.10:6802/11191 osd1 down out up_from 141 up_thru 160 down_at 166 last_clean_interval 117-139 osd2 up in weight 1 up_from 147 up_thru 222 down_at 146 last_clean_interval 125-145 192.168.10.12:6800/10173 192.168.10.12:6801/10173 192.168.10.12:6802/10173 osd3 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 130-152 osd4 down out up_from 155 up_thru 160 down_at 166 last_clean_interval 130-153 osd5 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 134-151 osd6 down out up_from 153 up_thru 160 down_at 170 last_clean_interval 135-151 osd7 down out up_from 157 up_thru 160 down_at 162 last_clean_interval 133-155 osd8 down out up_from 154 up_thru 160 down_at 166 last_clean_interval 134-152 osd9 down out up_from 155 up_thru 160 down_at 168 last_clean_interval 133-153 osd10 down out up_from 141 up_thru 160 down_at 168 last_clean_interval 118-139 osd11 down out up_from 145 up_thru 160 down_at 168 last_clean_interval 119-143 osd12 down out up_from 142 up_thru 160 down_at 166 last_clean_interval 123-140 osd13 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 123-145 osd14 up in weight 1 up_from 143 up_thru 223 down_at 142 last_clean_interval 124-141 192.168.10.24:6800/10122 192.168.10.24:6801/10122 192.168.10.24:6802/10122 osd15 down out up_from 147 up_thru 160 down_at 164 last_clean_interval 121-145 osd16 up in weight 1 up_from 148 up_thru 222 down_at 147 last_clean_interval 124-146 192.168.10.26:6800/9881 192.168.10.26:6801/9881 192.168.10.26:6802/9881 osd17 up in weight 1 up_from 148 up_thru 223 down_at 147 last_clean_interval 122-146 192.168.10.27:6800/9986 192.168.10.27:6801/9986 192.168.10.27:6802/9986 osd18 down out up_from 146 up_thru 160 down_at 168 last_clean_interval 124-144 osd19 down out up_from 147 up_thru 160 down_at 166 last_clean_interval 125-145 osd20 up in weight 1 up_from 148 up_thru 222 down_at 147 last_clean_interval 126-146 192.168.10.30:6800/9816 192.168.10.30:6801/9816 192.168.10.30:6802/9816 osd21 down out up_from 148 up_thru 160 down_at 168 last_clean_interval 126-146 osd22 up in weight 1 up_from 149 up_thru 220 down_at 148 last_clean_interval 127-147 192.168.10.32:6800/9640 192.168.10.32:6801/9640 192.168.10.32:6802/9640 osd23 down out up_from 153 up_thru 160 down_at 166 last_clean_interval 128-151 osd24 up in weight 1 up_from 150 up_thru 225 down_at 149 last_clean_interval 132-148 192.168.10.34:6800/10581 192.168.10.34:6801/10581 192.168.10.34:6802/10581 --------------------------------------------------------------------------------------------------------- Many pgs are degraded, I ssh to every down and out osd host to see log (/var/log/ceph/osd.x.log), but there is nothing recorded in logs... Why the logs stopped logging anything? So I run "cosd -i x -c /etc/ceph/ceph.conf" on every down osd individually to make them up and in, and then I can read/write files form ceph.... Could anyone tell me what's going on in my environment? 1. OSD Stability problem? 2. Ceph didn't write logs unexpectedly. Thanks a lot! :)
[global] pid file = /var/run/ceph/$name.pid ; debug ms =1 ; enable secure authentication ; auth supported = cephx [mon] mon data = /mon/mon$id debug mon = 0 [mon.a] host = MON1 mon addr = 192.168.10.1:6789 [mon.b] host = MON2 mon addr = 192.168.10.2:6789 [mon.c] host = MON3 mon addr = 192.168.10.3:6789 [mds] debug mds = 0 [mds.a] host = MDS1 [mds.b] host = MDS2 [osd] osd data = /mnt/ext4/osd$id osd journal = /data/osd$id/journal osd journal size = 512 ; journal size, in megabytes filestore btrfs snap = false filestore fsync flushes journal data = true debug osd = 0 [osd.0] host = OSD1 [osd.1] host = OSD2 [osd.2] host = OSD3 [osd.3] host = OSD4 [osd.4] host = OSD5 [osd.5] host = OSD6 [osd.6] host = OSD7 [osd.7] host = OSD8 [osd.8] host = OSD9 [osd.9] host = OSD10 [osd.10] host = OSD11 [osd.11] host = OSD12 [osd.12] host = OSD13 [osd.13] host = OSD14 [osd.14] host = OSD15 [osd.15] host = OSD16 [osd.16] host = OSD17 [osd.17] host = OSD18 [osd.18] host = OSD19 [osd.19] host = OSD20 [osd.20] host = OSD21 [osd.21] host = OSD22 [osd.22] host = OSD23 [osd.23] host = OSD24 [osd.24] host = OSD25