Re: all oas crush on start

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 18 Jul 2013 10:36:04 -0700

In the monitor log you sent along, the monitor was crashing on a
setcrushmap command. Where in this sequence of events did that happen?

On Wed, Jul 17, 2013 at 5:07 PM, Vladislav Gorbunov <vadikgo@xxxxxxxxx> wrote:
> That's what I did:
>
> cluster state HEALTH_OK
>
> 1. load crush map from cluster:
> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap1.txt
> 2. modify crush map for adding pool and ruleset iscsi with 2
> datacenters, upload crush map to cluster:
> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap2.txt
>
> 3. add host gstore1
>
> ceph-deploy osd create gstore1:/dev/sdh:/dev/sdb1
> ceph-deploy osd create gstore1:/dev/sdj:/dev/sdc1
> ceph-deploy osd create gstore1:/dev/sdk:/dev/sdc2
>
> 4. move this osds to datacenter datacenter-cod:
> ceph osd crush set 82 0 root=iscsi datacenter=datacenter-cod host=gstore1
> ceph osd crush set 83 0 root=iscsi datacenter=datacenter-cod host=gstore1
> ceph osd crush set 84 0 root=iscsi datacenter=datacenter-cod host=gstore1
>
> 5. cluster state HEALTH_OK, reweight new osds:
> ceph osd crush reweight osd.82 2.73
> ceph osd crush reweight osd.83 2.73
> ceph osd crush reweight osd.84 2.73
>
> 6. exclude osd.57 (in default pool) from cluster:
> ceph osd crush reweight osd.57 0
> cluster state HEALTH_WARN
>
> 7. add new node gstore2 same as gstore1
> ceph-deploy -v osd create gstore2:/dev/sdh:/dev/sdb1
> ceph osd crush set 94 2.73 root=iscsi datacenter=datacenter-rcod host=gstore2

Where are you getting these numbers 82-84 and 92-94 from? They don't
appear in any any of the maps you've sent along...

> 8. exclude osd.56 (in default pool) from cluster:
> ceph osd crush reweight osd.57 0
>
>
> 9. add new osd to gstore2
> ceph-deploy osd create gstore2:/dev/sdl:/dev/sdd1
> ceph-deploy osd create gstore2:/dev/sdm:/dev/sdd2
> …
> ceph-deploy osd create gstore2:/dev/sds:/dev/sdg2
>
> 10. rename pool iscsi in default crush pool :
> ceph osd pool rename iscsi iscsi-old
>
> 11. create new pool iscsi:
> ceph osd pool create iscsi 2048 2048
>
> 12. set ruleset iscsi to new pool iscsi
> ceph osd pool set iscsi crush_ruleset 3
>
> All OSDS downed with Segmentation fault

Okay, so you switched to actually start using the new rule and the
OSDs broke. It's possible there's a hole in our crush map testing that
would let this through.

> 13. failback ruleset 0 for pool iscsi
> ceph osd pool set iscsi crush_ruleset 0
>
> delete ruleset iscsi, upload crushmap to cluster
> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap14-new.txt
>
> OSD still Segmentation fault

Yeah, once you've put a bad map into the system then you can't fix it
by putting in a good one — all the OSDs need to evaluate the past maps
on startups, which includes the bad one, which makes them crash again.
:(

Can you provide us a tarball of one of your monitor directories? We'd
like to do some forensics on it to identify the scenario precisely and
prevent it from happening again.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com