Sounds like https://tracker.ceph.com/issues/24982 On Wed, Aug 1, 2018 at 10:18 AM Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi, > > > I'm currently upgrading our ceph cluster to 12.2.7. Most steps are fine, > but all mgr instances abort after restarting: > > > .... > > -10> 2018-08-01 09:57:46.357696 7fc481221700 5 -- > 192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000 > :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx > osd.70 seq 24 0x564cf4 > 708c00 mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5 > -9> 2018-08-01 09:57:46.357715 7fc46bf8e700 1 -- > 192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 24 ==== > mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5 ==== 784+0+0 > (3768598180 0 0) 0x564cf4708c00 con > 0x564cf2bf9000 > -8> 2018-08-01 09:57:46.357721 7fc46bf8e700 4 mgr.server > handle_report from 0x564cf2bf9000 osd,70 > -7> 2018-08-01 09:57:46.358255 7fc481221700 5 -- > 192.168.6.134:6856/5968 >> 192.168.6.137:6800/2921 conn(0x564cf2c90000 > :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=28 cs=1 l=1). rx > osd.20 seq 25 0x564cf4 > da63c0 pg_stats(72 pgs tid 0 v 0) v1 > -6> 2018-08-01 09:57:46.358303 7fc46bf8e700 1 -- > 192.168.6.134:6856/5968 <== osd.20 192.168.6.137:6800/2921 25 ==== > pg_stats(72 pgs tid 0 v 0) v1 ==== 42756+0+0 (3715458660 0 0) > 0x564cf4da63c0 con 0x564cf2c90000 > -5> 2018-08-01 09:57:46.358432 7fc481221700 5 -- > 192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000 > :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx > osd.70 seq 25 0x564cf4 > db2ec0 pg_stats(54 pgs tid 0 v 0) v1 > -4> 2018-08-01 09:57:46.358447 7fc46bf8e700 1 -- > 192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 25 ==== > pg_stats(54 pgs tid 0 v 0) v1 ==== 32928+0+0 (3225946058 0 0) > 0x564cf4db2ec0 con 0x564cf2bf9000 > -3> 2018-08-01 09:57:46.368820 7fc480a20700 5 -- > 192.168.6.134:6856/5968 >> 192.168.6.135:0/1209706031 > conn(0x564cf2f8d000 :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH > pgs=28 cs=1 l=1). rx client.19838915 seq > 13 0x564cf44cd500 mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5 > -2> 2018-08-01 09:57:46.368880 7fc46bf8e700 1 -- > 192.168.6.134:6856/5968 <== client.19838915 192.168.6.135:0/1209706031 > 13 ==== mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5 ==== 3425+0+0 > (3985820496 0 0) 0x564 > cf44cd500 con 0x564cf2f8d000 > -1> 2018-08-01 09:57:46.368895 7fc46bf8e700 4 mgr.server > handle_report from 0x564cf2f8d000 rgw,radosgw.gateway > 0> 2018-08-01 09:57:46.371034 7fc46bf8e700 -1 *** Caught signal > (Aborted) ** > in thread 7fc46bf8e700 thread_name:ms_dispatch > > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) > luminous (stable) > 1: (()+0x40e744) [0x564ce68e1744] > 2: (()+0x11390) [0x7fc484ede390] > 3: (gsignal()+0x38) [0x7fc483e6e428] > 4: (abort()+0x16a) [0x7fc483e7002a] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc4847b184d] > 6: (()+0x8d6b6) [0x7fc4847af6b6] > 7: (()+0x8d701) [0x7fc4847af701] > 8: (()+0x8d919) [0x7fc4847af919] > 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc4847d82cf] > 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564ce6775dec] > 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564ce677e3d9] > 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564ce678c5a7] > 13: (DispatchQueue::entry()+0xf4a) [0x564ce6c3baba] > 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564ce69dcaed] > 15: (()+0x76ba) [0x7fc484ed46ba] > 16: (clone()+0x6d) [0x7fc483f4041d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > The cause seems to be the RGW instances in our cluster. We use a HA > setup with pacemaker and haproxy on three hosts; two different RGW setup > serve internal and external users (three hosts with two RGW processes > each overall). As soon as two instance on two different hosts are > active, the mgrs crash with this stack trace. > > I've reduced the number of active RGW instances to one to stop the mgr > from crashing. Is this a regression in 12.2.7 for HA setups? > > > Regards, > > Burkhard > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com