Oops... to fast to answer...
G.
On Mon, 11 May 2015 12:13:48 +0300, Timofey Titovets wrote:
Hey! I catch it again. Its a kernel bug. Kernel crushed if i try to
map rbd device with map like above!
Hooray!
2015-05-11 12:11 GMT+03:00 Timofey Titovets <nefelim4ag@xxxxxxxxx>:
FYI and history
Rule:
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type room
step choose firstn 0 type rack
step choose firstn 0 type host
step chooseleaf firstn 0 type osd
step emit
}
And after reset node, i can't find any usable info. Cluster works
fine
and data just rebalanced by osd disks.
syslog:
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network
Time
Synchronization...
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network
Time
Synchronization.
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May 9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
installed, discarding output)
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
software="rsyslogd" swVersion="7.4.4" x-pid="689"
x-info="http://www.rsyslog.com"] start
May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid
changed to 103
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid
changed to 100
Sorry for noise, guys. Georgios, in any way, thanks for helping.
2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis
<giorgis@xxxxxxxxxxxx>:
Timofey,
may be your best chance is to connect directly at the server and
see what is
going on.
Then you can try debug why the problem occurred. If you don't want
to wait
until tomorrow
you may try to see what is going on using the server's direct
remote console
access.
The majority of the servers provide you with that just with a
different name
each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up
and
running you can use that.
I think this should be your starting point and you can take it on
from
there.
I am sorry I cannot help you further with the Crush rules and the
reason why
it crashed since I am far from being an expert in the field :-(
Regards,
George
Georgios, oh, sorry for my poor english _-_, may be I poor
expressed
what i want =]
i know how to write simple Crush rule and how use it, i want
several
things things:
1. Understand why, after inject bad map, my test node make
offline.
This is unexpected.
2. May be somebody can explain what and why happens with this map.
3. This is not a problem to write several crushmap or/and switch
it
while cluster running.
But, in production, we have several nfs servers, i think about
moving
it to ceph, but i can't disable more then 1 server for maintenance
simultaneously. I want avoid data disaster while setup and moving
data
to ceph, case like "Use local data replication, if only one node
exist" looks usable as temporally solution, while i not add second
node _-_.
4. May be some one also have test cluster and can test that happen
with clients, if crushmap like it was injected.
2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis
<giorgis@xxxxxxxxxxxx>:
Hi Timofey,
assuming that you have more than one OSD hosts and that the
replicator
factor is equal (or less) to the number of the hosts why don't
you just
change the crushmap to host replication?
You just need to change the default CRUSHmap rule from
step chooseleaf firstn 0 type osd
to
step chooseleaf firstn 0 type host
I believe that this is the easiest way to do have replication
across OSD
nodes unless you have a much more "sophisticated" setup.
Regards,
George
Hi list,
i had experiments with crush maps, and I've try to get raid1
like
behaviour (if cluster have 1 working osd node, duplicate data
across
local disk, for avoiding data lose in case local disk failure
and
allow client working, because this is not a degraded state)
(
in best case, i want dynamic rule, like:
if has only one host -> spread data over local disks;
else if host count > 1 -> spread over hosts (rack o something
else);
)
i write rule, like below:
rule test {
ruleset 0
type replicated
min_size 0
max_size 10
step take default
step choose firstn 0 type host
step chooseleaf firstn 0 type osd
step emit
}
I've inject it in cluster and client node, now looks like have
get
kernel panic, I've lost my connection with it. No ssh, no ping,
this
is remote node and i can't see what happens until Monday.
Yes, it looks like I've shoot in my foot.
This is just a test setup and cluster destruction, not a
problem, but
i think, what broken rules, must not crush something else and in
worst
case, must be just ignored by cluster/crushtool compiler.
May be someone can explain, how this rule can crush system? May
be
this is a crazy mistake somewhere?
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Have a nice day,
Timofey.
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com