In the monitor log you sent along, the monitor was crashing on a setcrushmap command. Where in this sequence of events did that happen? On Wed, Jul 17, 2013 at 5:07 PM, Vladislav Gorbunov <vadikgo@xxxxxxxxx> wrote: > That's what I did: > > cluster state HEALTH_OK > > 1. load crush map from cluster: > https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap1.txt > 2. modify crush map for adding pool and ruleset iscsi with 2 > datacenters, upload crush map to cluster: > https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap2.txt > > 3. add host gstore1 > > ceph-deploy osd create gstore1:/dev/sdh:/dev/sdb1 > ceph-deploy osd create gstore1:/dev/sdj:/dev/sdc1 > ceph-deploy osd create gstore1:/dev/sdk:/dev/sdc2 > > 4. move this osds to datacenter datacenter-cod: > ceph osd crush set 82 0 root=iscsi datacenter=datacenter-cod host=gstore1 > ceph osd crush set 83 0 root=iscsi datacenter=datacenter-cod host=gstore1 > ceph osd crush set 84 0 root=iscsi datacenter=datacenter-cod host=gstore1 > > 5. cluster state HEALTH_OK, reweight new osds: > ceph osd crush reweight osd.82 2.73 > ceph osd crush reweight osd.83 2.73 > ceph osd crush reweight osd.84 2.73 > > 6. exclude osd.57 (in default pool) from cluster: > ceph osd crush reweight osd.57 0 > cluster state HEALTH_WARN > > 7. add new node gstore2 same as gstore1 > ceph-deploy -v osd create gstore2:/dev/sdh:/dev/sdb1 > ceph osd crush set 94 2.73 root=iscsi datacenter=datacenter-rcod host=gstore2 Where are you getting these numbers 82-84 and 92-94 from? They don't appear in any any of the maps you've sent along... > 8. exclude osd.56 (in default pool) from cluster: > ceph osd crush reweight osd.57 0 > > > 9. add new osd to gstore2 > ceph-deploy osd create gstore2:/dev/sdl:/dev/sdd1 > ceph-deploy osd create gstore2:/dev/sdm:/dev/sdd2 > … > ceph-deploy osd create gstore2:/dev/sds:/dev/sdg2 > > 10. rename pool iscsi in default crush pool : > ceph osd pool rename iscsi iscsi-old > > 11. create new pool iscsi: > ceph osd pool create iscsi 2048 2048 > > 12. set ruleset iscsi to new pool iscsi > ceph osd pool set iscsi crush_ruleset 3 > > All OSDS downed with Segmentation fault Okay, so you switched to actually start using the new rule and the OSDs broke. It's possible there's a hole in our crush map testing that would let this through. > 13. failback ruleset 0 for pool iscsi > ceph osd pool set iscsi crush_ruleset 0 > > delete ruleset iscsi, upload crushmap to cluster > https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap14-new.txt > > OSD still Segmentation fault Yeah, once you've put a bad map into the system then you can't fix it by putting in a good one — all the OSDs need to evaluate the past maps on startups, which includes the bad one, which makes them crash again. :( Can you provide us a tarball of one of your monitor directories? We'd like to do some forensics on it to identify the scenario precisely and prevent it from happening again. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com