Hey! I catch it again. Its a kernel bug. Kernel crushed if i try to map rbd device with map like above! Hooray! 2015-05-11 12:11 GMT+03:00 Timofey Titovets <nefelim4ag@xxxxxxxxx>: > FYI and history > Rule: > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step choose firstn 0 type room > step choose firstn 0 type rack > step choose firstn 0 type host > step chooseleaf firstn 0 type osd > step emit > } > > And after reset node, i can't find any usable info. Cluster works fine > and data just rebalanced by osd disks. > syslog: > May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading. > May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network Time > Synchronization... > May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network Time > Synchronization. > May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading. > May 9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA > installed, discarding output) > May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin > software="rsyslogd" swVersion="7.4.4" x-pid="689" > x-info="http://www.rsyslog.com"] start > May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid changed to 103 > May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid changed to 100 > > Sorry for noise, guys. Georgios, in any way, thanks for helping. > > 2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx>: >> Timofey, >> >> may be your best chance is to connect directly at the server and see what is >> going on. >> Then you can try debug why the problem occurred. If you don't want to wait >> until tomorrow >> you may try to see what is going on using the server's direct remote console >> access. >> The majority of the servers provide you with that just with a different name >> each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up and >> running you can use that. >> >> I think this should be your starting point and you can take it on from >> there. >> >> I am sorry I cannot help you further with the Crush rules and the reason why >> it crashed since I am far from being an expert in the field :-( >> >> Regards, >> >> George >> >> >>> Georgios, oh, sorry for my poor english _-_, may be I poor expressed >>> what i want =] >>> >>> i know how to write simple Crush rule and how use it, i want several >>> things things: >>> 1. Understand why, after inject bad map, my test node make offline. >>> This is unexpected. >>> 2. May be somebody can explain what and why happens with this map. >>> 3. This is not a problem to write several crushmap or/and switch it >>> while cluster running. >>> But, in production, we have several nfs servers, i think about moving >>> it to ceph, but i can't disable more then 1 server for maintenance >>> simultaneously. I want avoid data disaster while setup and moving data >>> to ceph, case like "Use local data replication, if only one node >>> exist" looks usable as temporally solution, while i not add second >>> node _-_. >>> 4. May be some one also have test cluster and can test that happen >>> with clients, if crushmap like it was injected. >>> >>> 2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx>: >>>> >>>> Hi Timofey, >>>> >>>> assuming that you have more than one OSD hosts and that the replicator >>>> factor is equal (or less) to the number of the hosts why don't you just >>>> change the crushmap to host replication? >>>> >>>> You just need to change the default CRUSHmap rule from >>>> >>>> step chooseleaf firstn 0 type osd >>>> >>>> to >>>> >>>> step chooseleaf firstn 0 type host >>>> >>>> I believe that this is the easiest way to do have replication across OSD >>>> nodes unless you have a much more "sophisticated" setup. >>>> >>>> Regards, >>>> >>>> George >>>> >>>> >>>> >>>>> Hi list, >>>>> i had experiments with crush maps, and I've try to get raid1 like >>>>> behaviour (if cluster have 1 working osd node, duplicate data across >>>>> local disk, for avoiding data lose in case local disk failure and >>>>> allow client working, because this is not a degraded state) >>>>> ( >>>>> in best case, i want dynamic rule, like: >>>>> if has only one host -> spread data over local disks; >>>>> else if host count > 1 -> spread over hosts (rack o something else); >>>>> ) >>>>> >>>>> i write rule, like below: >>>>> >>>>> rule test { >>>>> ruleset 0 >>>>> type replicated >>>>> min_size 0 >>>>> max_size 10 >>>>> step take default >>>>> step choose firstn 0 type host >>>>> step chooseleaf firstn 0 type osd >>>>> step emit >>>>> } >>>>> >>>>> I've inject it in cluster and client node, now looks like have get >>>>> kernel panic, I've lost my connection with it. No ssh, no ping, this >>>>> is remote node and i can't see what happens until Monday. >>>>> Yes, it looks like I've shoot in my foot. >>>>> This is just a test setup and cluster destruction, not a problem, but >>>>> i think, what broken rules, must not crush something else and in worst >>>>> case, must be just ignored by cluster/crushtool compiler. >>>>> >>>>> May be someone can explain, how this rule can crush system? May be >>>>> this is a crazy mistake somewhere? >>>> >>>> >>>> >>>> -- >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> -- >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Have a nice day, > Timofey. -- Have a nice day, Timofey. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com