2011/4/29 Wido den Hollander <wido@xxxxxxxxx>: > Hi, > > On Fri, 2011-04-29 at 15:20 +0800, AnnyRen wrote: >> Hi, >> >> I Set up a Ceph environment with 3 mon, 2 mds and 30 osd, >> Recently, we are testing the stalibility of mon, >> >> The following are my test steps: >> 1. on MON1(where mon.0 running): >> -> ceph mon remove 0 >> >> 2. on MON2 (where mon.1 running) >> --> ceph mon add 0 192.268.200.181:6789 (mon.0 ip:port) >> >> 3. on MON2: rsync >> --> rsync -av /data1/mon1/ MON1:/data1/mon0 >> >> 4. on MON1: start mon.0 daemon >> --> cmon -i 0 -c /etc/ceph/ceph.conf > > I'm missing what you did here? You say you have 3 monitors, but here you > see to have two monitors? originally, I have 3 monitor, I want to remove the leader monitor to see whether removing a monitor will affect client writes files. so I use ceph mon remove instruction to modify monmap, and It's ok and works well, but when I use 'ceph mon add' to modify monmap, and rsync monmap to the removed monitor, and then start the mon service on that rejoined monitor. > > And why did you rsync your data from MON2 to MON1? If a monitor has been > down for a while you don't have to sync the data. > > What were you trying to achieve here? Expand the number of monitors? If > so, please refer to > http://ceph.newdream.net/wiki/Monitor_cluster_expansion > >> >> 5. Using 'ceph -w' to watch the message >> >> I saw osd 30 , 30 up, 30 in... >> >> >> but later, all the osd are marked down and out.... > > Are you sure the OSD processes are still running? yes, I execute 'ps aux|grep cosd' on all OSD host the service of osd are running. > > If I read your command sequence correctly it could be that all your > OSD's got cut off from the cluster if not all three monitors were in the > monmap from the time the OSD's started. By removing the first monitor > and adding the second one afterwards it could be that you cut off the > monitors. > > Could you sent your ceph.conf? > ceph.conf is as attachment. Dear Wido, thanks for your quickly response!! > Wido > >> >> 2011-04-29 06:51:57.309468 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 0. >> Marking down! >> 2011-04-29 06:51:57.309499 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 1. >> Marking down! >> 2011-04-29 06:51:57.309513 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 2. >> Marking down! >> 2011-04-29 06:51:57.309523 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 3. >> Marking down! >> 2011-04-29 06:51:57.309532 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 4. >> Marking down! >> 2011-04-29 06:51:57.309541 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 5. >> Marking down! >> 2011-04-29 06:51:57.309550 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 6. >> Marking down! >> 2011-04-29 06:51:57.309558 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 7. >> Marking down! >> 2011-04-29 06:51:57.309567 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 8. >> Marking down! >> 2011-04-29 06:51:57.309576 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 9. >> Marking down! >> 2011-04-29 06:51:57.309585 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 10. Marking down! >> 2011-04-29 06:51:57.309594 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 11. Marking down! >> 2011-04-29 06:51:57.309603 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 12. Marking down! >> 2011-04-29 06:51:57.309613 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 13. Marking down! >> 2011-04-29 06:51:57.309622 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 14. Marking down! >> 2011-04-29 06:51:57.309631 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 15. Marking down! >> 2011-04-29 06:51:57.309648 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 16. Marking down! >> 2011-04-29 06:51:57.309657 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 17. Marking down! >> 2011-04-29 06:51:57.309666 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 18. Marking down! >> 2011-04-29 06:51:57.309674 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 19. Marking down! >> 2011-04-29 06:51:57.309682 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 20. Marking down! >> 2011-04-29 06:51:57.309691 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 21. Marking down! >> 2011-04-29 06:51:57.309699 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 22. Marking down! >> 2011-04-29 06:51:57.309707 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 23. Marking down! >> 2011-04-29 06:51:57.309716 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 24. Marking down! >> 2011-04-29 06:51:57.309724 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 25. Marking down! >> 2011-04-29 06:51:57.309732 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 26. Marking down! >> 2011-04-29 06:51:57.309741 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 27. Marking down! >> 2011-04-29 06:51:57.309749 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 28. Marking down! >> 2011-04-29 06:51:57.309757 7fb17da56710 mon.0@0(leader).osd e148 >> OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd >> 29. Marking down! >> >> 2011-04-29 06:57:02.359021 7fb17da56710 log [INF] : osd0 out (down for >> 304.884685) >> 2011-04-29 06:57:02.359070 7fb17da56710 log [INF] : osd1 out (down for >> 304.884684) >> 2011-04-29 06:57:02.359090 7fb17da56710 log [INF] : osd2 out (down for >> 304.884684) >> 2011-04-29 06:57:02.359109 7fb17da56710 log [INF] : osd3 out (down for >> 304.884683) >> 2011-04-29 06:57:02.359126 7fb17da56710 log [INF] : osd4 out (down for >> 304.884683) >> 2011-04-29 06:57:02.359145 7fb17da56710 log [INF] : osd5 out (down for >> 304.884682) >> 2011-04-29 06:57:02.359162 7fb17da56710 log [INF] : osd6 out (down for >> 304.884682) >> 2011-04-29 06:57:02.359179 7fb17da56710 log [INF] : osd7 out (down for >> 304.884682) >> 2011-04-29 06:57:02.359197 7fb17da56710 log [INF] : osd8 out (down for >> 304.884682) >> 2011-04-29 06:57:02.359215 7fb17da56710 log [INF] : osd9 out (down for >> 304.884682) >> 2011-04-29 06:57:02.359234 7fb17da56710 log [INF] : osd10 out (down >> for 304.884681) >> 2011-04-29 06:57:02.359252 7fb17da56710 log [INF] : osd11 out (down >> for 304.884681) >> 2011-04-29 06:57:02.359270 7fb17da56710 log [INF] : osd12 out (down >> for 304.884681) >> 2011-04-29 06:57:02.359289 7fb17da56710 log [INF] : osd13 out (down >> for 304.884680) >> 2011-04-29 06:57:02.359308 7fb17da56710 log [INF] : osd14 out (down >> for 304.884680) >> 2011-04-29 06:57:02.359328 7fb17da56710 log [INF] : osd15 out (down >> for 304.884680) >> 2011-04-29 06:57:02.359347 7fb17da56710 log [INF] : osd16 out (down >> for 304.884679) >> 2011-04-29 06:57:02.359367 7fb17da56710 log [INF] : osd17 out (down >> for 304.884679) >> 2011-04-29 06:57:02.359386 7fb17da56710 log [INF] : osd18 out (down >> for 304.884679) >> 2011-04-29 06:57:02.359406 7fb17da56710 log [INF] : osd19 out (down >> for 304.884679) >> 2011-04-29 06:57:02.359425 7fb17da56710 log [INF] : osd20 out (down >> for 304.884678) >> 2011-04-29 06:57:02.359458 7fb17da56710 log [INF] : osd21 out (down >> for 304.884678) >> 2011-04-29 06:57:02.359480 7fb17da56710 log [INF] : osd22 out (down >> for 304.884678) >> 2011-04-29 06:57:02.359499 7fb17da56710 log [INF] : osd23 out (down >> for 304.884677) >> 2011-04-29 06:57:02.359518 7fb17da56710 log [INF] : osd24 out (down >> for 304.884677) >> 2011-04-29 06:57:02.359538 7fb17da56710 log [INF] : osd25 out (down >> for 304.884677) >> 2011-04-29 06:57:02.359557 7fb17da56710 log [INF] : osd26 out (down >> for 304.884676) >> 2011-04-29 06:57:02.359577 7fb17da56710 log [INF] : osd27 out (down >> for 304.884676) >> 2011-04-29 06:57:02.359596 7fb17da56710 log [INF] : osd28 out (down >> for 304.884676) >> 2011-04-29 06:57:02.359616 7fb17da56710 log [INF] : osd29 out (down >> for 304.884675) >> >> >> root@MON1:~# ceph -s >> 2011-04-29 07:17:00.226534 pg v3025: 7920 pgs: 7920 active+clean; >> 3650 MB data, 0 KB used, 0 KB / 0 KB avail >> 2011-04-29 07:17:00.239044 mds e42: 1/1/1 up {0=up:active}, 1 up:standby >> 2011-04-29 07:17:00.239080 osd e151: 30 osds: 0 up, 0 in >> 2011-04-29 07:17:00.239175 log 2011-04-29 06:57:02.359625 mon0 >> 192.168.200.181:6789/0 71 : [INF] osd29 out (down for 304.884675) >> 2011-04-29 07:17:00.239233 mon e3: 3 mons at >> {0=192.168.200.181:6789/0,1=192.168.200.182:6789/0,2=192.168.200.183:6789/0} >> >> >> Anyone knows how to fix my environment to make ceph be health? >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
[global] pid file = /var/run/ceph/$name.pid ; debug ms =1 ; enable secure authentication ; auth supported = cephx [mon] mon data = /data1/mon$id debug mon = 0 [mon0] host = MON1 mon addr = 192.168.200.181:6789 [mon1] host = MON2 mon addr = 192.168.200.182:6789 [mon2] host = MON3 mon addr = 192.168.200.183:6789 [mds] debug mds = 0 [mds0] host = MDS1 [mds1] host = MDS2 [osd] osd data = /mnt/ext4/osd$id osd journal = /mnt/ext4/osd$id/journal osd journal size = 512 ; journal size, in megabytes filestore btrfs snap = false filestore fsync flushes journal data = true debug osd = 0 [osd0] host = OSD1 [osd1] host = OSD1 [osd2] host = OSD1 [osd3] host = OSD2 [osd4] host = OSD2 [osd5] host = OSD2 [osd6] host = OSD3 [osd7] host = OSD3 [osd8] host = OSD3 [osd9] host = OSD4 [osd10] host = OSD4 [osd11] host = OSD4 [osd12] host = OSD5 [osd13] host = OSD5 [osd14] host = OSD5 [osd15] host = OSD6 [osd16] host = OSD6 [osd17] host = OSD6 [osd18] host = OSD7 [osd19] host = OSD7 [osd20] host = OSD7 [osd21] host = OSD8 [osd22] host = OSD8 [osd23] host = OSD8 [osd24] host = OSD9 [osd25] host = OSD9 [osd26] host = OSD9 [osd27] host = OSD10 [osd28] host = OSD10 [osd29] host = OSD10