Re: all oas crush on start

Vladislav Gorbunov <vadikgo@xxxxxxxxx> · Fri, 19 Jul 2013 10:10:39 +1200



>In the monitor log you sent along, the monitor was crashing on a
setcrushmap command. Where in this sequence of events did that happen?
It's happened after I try to upload different crushmap, much later step 13.

>Where are you getting these numbers 82-84 and 92-94 from? They don't
appear in any any of the maps you've sent along...
Sorry, this is crushmap after OSDs was broken:
https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap14-2.txt

>Can you provide us a tarball of one of your monitor directories?
https://dl.dropboxusercontent.com/u/2296931/ceph/ceph-mon.1.tar.bz2

2013/7/19 Gregory Farnum <greg@xxxxxxxxxxx>:
> In the monitor log you sent along, the monitor was crashing on a
> setcrushmap command. Where in this sequence of events did that happen?
>
> On Wed, Jul 17, 2013 at 5:07 PM, Vladislav Gorbunov <vadikgo@xxxxxxxxx> wrote:
>> That's what I did:
>>
>> cluster state HEALTH_OK
>>
>> 1. load crush map from cluster:
>> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap1.txt
>> 2. modify crush map for adding pool and ruleset iscsi with 2
>> datacenters, upload crush map to cluster:
>> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap2.txt
>>
>> 3. add host gstore1
>>
>> ceph-deploy osd create gstore1:/dev/sdh:/dev/sdb1
>> ceph-deploy osd create gstore1:/dev/sdj:/dev/sdc1
>> ceph-deploy osd create gstore1:/dev/sdk:/dev/sdc2
>>
>> 4. move this osds to datacenter datacenter-cod:
>> ceph osd crush set 82 0 root=iscsi datacenter=datacenter-cod host=gstore1
>> ceph osd crush set 83 0 root=iscsi datacenter=datacenter-cod host=gstore1
>> ceph osd crush set 84 0 root=iscsi datacenter=datacenter-cod host=gstore1
>>
>> 5. cluster state HEALTH_OK, reweight new osds:
>> ceph osd crush reweight osd.82 2.73
>> ceph osd crush reweight osd.83 2.73
>> ceph osd crush reweight osd.84 2.73
>>
>> 6. exclude osd.57 (in default pool) from cluster:
>> ceph osd crush reweight osd.57 0
>> cluster state HEALTH_WARN
>>
>> 7. add new node gstore2 same as gstore1
>> ceph-deploy -v osd create gstore2:/dev/sdh:/dev/sdb1
>> ceph osd crush set 94 2.73 root=iscsi datacenter=datacenter-rcod host=gstore2
>
> Where are you getting these numbers 82-84 and 92-94 from? They don't
> appear in any any of the maps you've sent along...
>
>
>> 8. exclude osd.56 (in default pool) from cluster:
>> ceph osd crush reweight osd.57 0
>>
>>
>> 9. add new osd to gstore2
>> ceph-deploy osd create gstore2:/dev/sdl:/dev/sdd1
>> ceph-deploy osd create gstore2:/dev/sdm:/dev/sdd2
>> …
>> ceph-deploy osd create gstore2:/dev/sds:/dev/sdg2
>>
>> 10. rename pool iscsi in default crush pool :
>> ceph osd pool rename iscsi iscsi-old
>>
>> 11. create new pool iscsi:
>> ceph osd pool create iscsi 2048 2048
>>
>> 12. set ruleset iscsi to new pool iscsi
>> ceph osd pool set iscsi crush_ruleset 3
>>
>> All OSDS downed with Segmentation fault
>
> Okay, so you switched to actually start using the new rule and the
> OSDs broke. It's possible there's a hole in our crush map testing that
> would let this through.
>
>> 13. failback ruleset 0 for pool iscsi
>> ceph osd pool set iscsi crush_ruleset 0
>>
>> delete ruleset iscsi, upload crushmap to cluster
>> https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap14-new.txt
>>
>> OSD still Segmentation fault
>
> Yeah, once you've put a bad map into the system then you can't fix it
> by putting in a good one — all the OSDs need to evaluate the past maps
> on startups, which includes the bad one, which makes them crash again.
> :(
>
> Can you provide us a tarball of one of your monitor directories? We'd
> like to do some forensics on it to identify the scenario precisely and
> prevent it from happening again.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com