Monitor crash after changing replicated crush rulesets in jewel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I've stumpled across a problem in jewel with respect to crush rulesets. Our setup currently define two replicated rulesets:
# ceph osd crush rule list
[
    "replicated_ruleset",
    "replicated_ssd_only",
    "six_two_ec"
]
(third ruleset is a EC ruleset)
Both rulesets are quite simple:
# ceph osd crush rule dump replicated_ssd_only
{
    "rule_id": 1,
    "rule_name": "replicated_ssd_only",
    "ruleset": 2,
    "type": 1,
    "min_size": 2,
    "max_size": 4,
    "steps": [
        {
            "op": "take",
            "item": -9,
            "item_name": "ssd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

# ceph osd crush rule dump replicated_ruleset
{
    "rule_id": 0,
    "rule_name": "replicated_ruleset",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -3,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}
The corresponding crush tree has two roots:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -9   5.97263 root ssd
-18   0.53998     host ceph-storage-06-ssd
 86   0.26999         osd.86                    up  1.00000 1.00000
 88   0.26999         osd.88                    up  1.00000 1.00000
-19   0.53998     host ceph-storage-05-ssd
100   0.26999         osd.100                   up  1.00000 1.00000
 99   0.26999         osd.99                    up  1.00000 1.00000
...
 -3 531.43933 root default
-10  61.87991     host ceph-storage-02
 35   5.45999         osd.35                    up  1.00000 1.00000
 74   5.45999         osd.74                    up  1.00000 1.00000
111   5.45999         osd.111                   up  1.00000 1.00000
112   5.45999         osd.112                   up  1.00000 1.00000
113   5.45999         osd.113                   up  1.00000 1.00000
114   5.45999         osd.114                   up  1.00000 1.00000
115   5.45999         osd.115                   up  1.00000 1.00000
116   5.45999         osd.116                   up  1.00000 1.00000
117   5.45999         osd.117                   up  1.00000 1.00000
118   3.64000         osd.118                   up  1.00000 1.00000
119   5.45999         osd.119                   up  1.00000 1.00000
120   3.64000         osd.120                   up  1.00000 1.00000
....
So the first (default) ruleset should use spinning rust, the second one should use the SSDs. Pretty standard setup for SSDs colocated with HDDs.

After changing the crush ruleset for an existing pool ('.log' from radosgw) to replicated_ssd_only, two of three mons crashed leaving the cluster unaccessible. Log file content:

....
-13> 2016-08-18 12:22:10.800961 7fb7b5ae2700 10 log_client _send_to_monlog to self -12> 2016-08-18 12:22:10.800961 7fb7b5ae2700 10 log_client log_queue is 8 last_log 8 sent 7 num 8 unsent 1 sending 1 -11> 2016-08-18 12:22:10.800963 7fb7b5ae2700 10 log_client will send 2016-08-18 12:22:10.800960 mon.1 192.168.6.133:6789/0 8 : audit [I NF] from='client.3839479 :/0' entity='unknown.' cmd=[{"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"}]: dispa
tch
-10> 2016-08-18 12:22:10.800969 7fb7b5ae2700 1 -- 192.168.6.133:6789/0 --> 192.168.6.133:6789/0 -- log(1 entries from seq 8 at 2016-08-
18 12:22:10.800960) v1 -- ?+0 0x7fb7cc4318c0 con 0x7fb7cb5f6e80
-9> 2016-08-18 12:22:10.800977 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.800976, event: psvc:dispatch, op: mo n_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -8> 2016-08-18 12:22:10.800980 7fb7b5ae2700 5 mon.ceph-storage-05@1(leader).paxos(paxos active c 79420671..79421306) is_readable = 1 - now=2016-08-18 12:22:10.800980 lease_expire=2016-08-18 12:22:15.796784 has v0 lc 79421306 -7> 2016-08-18 12:22:10.800986 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.800986, event: osdmap:preprocess_que ry, op: mon_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -6> 2016-08-18 12:22:10.800992 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.800992, event: osdmap:preprocess_com mand, op: mon_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -5> 2016-08-18 12:22:10.801022 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.801022, event: osdmap:prepare_update , op: mon_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -4> 2016-08-18 12:22:10.801029 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.801029, event: osdmap:prepare_comman d, op: mon_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -3> 2016-08-18 12:22:10.801041 7fb7b5ae2700 5 -- op tracker -- seq: 92, time: 2016-08-18 12:22:10.801041, event: osdmap:prepare_comman d_impl, op: mon_command({"var": "crush_ruleset", "prefix": "osd pool set", "pool": ".log", "val": "2"} v 0) -2> 2016-08-18 12:22:10.802750 7fb7af185700 1 -- 192.168.6.133:6789/0 >> :/0 pipe(0x7fb7cc373400 sd=56 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f
b7cc34aa80).accept sd=56 192.168.6.132:53238/0
-1> 2016-08-18 12:22:10.802877 7fb7af185700 2 -- 192.168.6.133:6789/0 >> 192.168.6.132:6800/21078 pipe(0x7fb7cc373400 sd=56 :6789 s=2 pgs=89 cs=1 l=1 c=0x7fb7cc34aa80).reader got KEEPALIVE2 2016-08-18 12:22:10.802927 0> 2016-08-18 12:22:10.802989 7fb7b5ae2700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fb7b5ae2700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x5055ea) [0x7fb7bfc9d5ea]
 2: (()+0xf100) [0x7fb7be520100]
3: (OSDMonitor::prepare_command_pool_set(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, st d::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, b oost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::v ariant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void _, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detai l::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::voi d_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::deta il::variant::void_> > > >&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&)+0x122f) [0x7fb7bfaa997f] 4: (OSDMonitor::prepare_command_impl(std::shared_ptr<MonOpRequest>, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&)+0xf02c) [0x7fb7bfab968c] 5: (OSDMonitor::prepare_command(std::shared_ptr<MonOpRequest>)+0x64f) [0x7fb7bfabe46f] 6: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x307) [0x7fb7bfabffc7] 7: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xe0b) [0x7fb7bfa6e60b] 8: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x1d22) [0x7fb7bfa2a4f2] 9: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7fb7bfa3617b]
 10: (Monitor::_ms_dispatch(Message*)+0x6c9) [0x7fb7bfa37519]
11: (Monitor::handle_forward(std::shared_ptr<MonOpRequest>)+0x89c) [0x7fb7bfa359ac] 12: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xc70) [0x7fb7bfa36ab0]
 13: (Monitor::_ms_dispatch(Message*)+0x6c9) [0x7fb7bfa37519]
 14: (Monitor::ms_dispatch(Message*)+0x23) [0x7fb7bfa58063]
 15: (DispatchQueue::entry()+0x78a) [0x7fb7bfeb0d1a]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb7bfda620d]
 17: (()+0x7dc5) [0x7fb7be518dc5]
 18: (clone()+0x6d) [0x7fb7bcde0ced]

Complete log is available on request. I was able to recover the cluster by fencing the third still active mon (shutdown of network interface) and restarting the other two mons. They keep on crashing after a short time with the same stack trace until I was able to issue the command for changing the crush ruleset back to the 'replicated_ruleset'. After reenabling the network interface and restarting the services, the third mon (and the OSD on that host) rejoined the cluster.

Regards,
Burkhard


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux