Re: mgr abort during upgrade 12.2.5 -> 12.2.7 due to multiple active RGW clones

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 1 Aug 2018 11:14:08 +0200



Sounds like https://tracker.ceph.com/issues/24982
On Wed, Aug 1, 2018 at 10:18 AM Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
>
> I'm currently upgrading our ceph cluster to 12.2.7. Most steps are fine,
> but all mgr instances abort after restarting:
>
>
> ....
>
>     -10> 2018-08-01 09:57:46.357696 7fc481221700  5 --
> 192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000
> :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx
> osd.70 seq 24 0x564cf4
> 708c00 mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5
>      -9> 2018-08-01 09:57:46.357715 7fc46bf8e700  1 --
> 192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 24 ====
> mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5 ==== 784+0+0
> (3768598180 0 0) 0x564cf4708c00 con
>   0x564cf2bf9000
>      -8> 2018-08-01 09:57:46.357721 7fc46bf8e700  4 mgr.server
> handle_report from 0x564cf2bf9000 osd,70
>      -7> 2018-08-01 09:57:46.358255 7fc481221700  5 --
> 192.168.6.134:6856/5968 >> 192.168.6.137:6800/2921 conn(0x564cf2c90000
> :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=28 cs=1 l=1). rx
> osd.20 seq 25 0x564cf4
> da63c0 pg_stats(72 pgs tid 0 v 0) v1
>      -6> 2018-08-01 09:57:46.358303 7fc46bf8e700  1 --
> 192.168.6.134:6856/5968 <== osd.20 192.168.6.137:6800/2921 25 ====
> pg_stats(72 pgs tid 0 v 0) v1 ==== 42756+0+0 (3715458660 0 0)
> 0x564cf4da63c0 con 0x564cf2c90000
>      -5> 2018-08-01 09:57:46.358432 7fc481221700  5 --
> 192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000
> :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx
> osd.70 seq 25 0x564cf4
> db2ec0 pg_stats(54 pgs tid 0 v 0) v1
>      -4> 2018-08-01 09:57:46.358447 7fc46bf8e700  1 --
> 192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 25 ====
> pg_stats(54 pgs tid 0 v 0) v1 ==== 32928+0+0 (3225946058 0 0)
> 0x564cf4db2ec0 con 0x564cf2bf9000
>      -3> 2018-08-01 09:57:46.368820 7fc480a20700  5 --
> 192.168.6.134:6856/5968 >> 192.168.6.135:0/1209706031
> conn(0x564cf2f8d000 :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH
> pgs=28 cs=1 l=1). rx client.19838915 seq
>   13 0x564cf44cd500 mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5
>      -2> 2018-08-01 09:57:46.368880 7fc46bf8e700  1 --
> 192.168.6.134:6856/5968 <== client.19838915 192.168.6.135:0/1209706031
> 13 ==== mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5 ==== 3425+0+0
> (3985820496 0 0) 0x564
> cf44cd500 con 0x564cf2f8d000
>      -1> 2018-08-01 09:57:46.368895 7fc46bf8e700  4 mgr.server
> handle_report from 0x564cf2f8d000 rgw,radosgw.gateway
>       0> 2018-08-01 09:57:46.371034 7fc46bf8e700 -1 *** Caught signal
> (Aborted) **
>   in thread 7fc46bf8e700 thread_name:ms_dispatch
>
>   ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)
>   1: (()+0x40e744) [0x564ce68e1744]
>   2: (()+0x11390) [0x7fc484ede390]
>   3: (gsignal()+0x38) [0x7fc483e6e428]
>   4: (abort()+0x16a) [0x7fc483e7002a]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc4847b184d]
>   6: (()+0x8d6b6) [0x7fc4847af6b6]
>   7: (()+0x8d701) [0x7fc4847af701]
>   8: (()+0x8d919) [0x7fc4847af919]
>   9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc4847d82cf]
>   10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564ce6775dec]
>   11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564ce677e3d9]
>   12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564ce678c5a7]
>   13: (DispatchQueue::entry()+0xf4a) [0x564ce6c3baba]
>   14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564ce69dcaed]
>   15: (()+0x76ba) [0x7fc484ed46ba]
>   16: (clone()+0x6d) [0x7fc483f4041d]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
>
> The cause seems to be the RGW instances in our cluster. We use a HA
> setup with pacemaker and haproxy on three hosts; two different RGW setup
> serve internal and external users (three hosts with two RGW processes
> each overall). As soon as two instance on two different hosts are
> active, the mgrs crash with this stack trace.
>
> I've reduced the number of active RGW instances to one to stop the mgr
> from crashing. Is this a regression in 12.2.7 for HA setups?
>
>
> Regards,
>
> Burkhard
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com