mgr abort during upgrade 12.2.5 -> 12.2.7 due to multiple active RGW clones

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 1 Aug 2018 10:18:30 +0200

Hi,

I'm currently upgrading our ceph cluster to 12.2.7. Most steps are fine, 
but all mgr instances abort after restarting:

....

   -10> 2018-08-01 09:57:46.357696 7fc481221700  5 -- 
192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000 
:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx 
osd.70 seq 24 0x564cf4
708c00 mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5
    -9> 2018-08-01 09:57:46.357715 7fc46bf8e700  1 -- 
192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 24 ==== 
mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5 ==== 784+0+0 
(3768598180 0 0) 0x564cf4708c00 con
 0x564cf2bf9000
    -8> 2018-08-01 09:57:46.357721 7fc46bf8e700  4 mgr.server 
handle_report from 0x564cf2bf9000 osd,70
    -7> 2018-08-01 09:57:46.358255 7fc481221700  5 -- 
192.168.6.134:6856/5968 >> 192.168.6.137:6800/2921 conn(0x564cf2c90000 
:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=28 cs=1 l=1). rx 
osd.20 seq 25 0x564cf4
da63c0 pg_stats(72 pgs tid 0 v 0) v1
    -6> 2018-08-01 09:57:46.358303 7fc46bf8e700  1 -- 
192.168.6.134:6856/5968 <== osd.20 192.168.6.137:6800/2921 25 ==== 
pg_stats(72 pgs tid 0 v 0) v1 ==== 42756+0+0 (3715458660 0 0) 
0x564cf4da63c0 con 0x564cf2c90000
    -5> 2018-08-01 09:57:46.358432 7fc481221700  5 -- 
192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000 
:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rx 
osd.70 seq 25 0x564cf4
db2ec0 pg_stats(54 pgs tid 0 v 0) v1
    -4> 2018-08-01 09:57:46.358447 7fc46bf8e700  1 -- 
192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 25 ==== 
pg_stats(54 pgs tid 0 v 0) v1 ==== 32928+0+0 (3225946058 0 0) 
0x564cf4db2ec0 con 0x564cf2bf9000
    -3> 2018-08-01 09:57:46.368820 7fc480a20700  5 -- 
192.168.6.134:6856/5968 >> 192.168.6.135:0/1209706031 
conn(0x564cf2f8d000 :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=28 cs=1 l=1). rx client.19838915 seq
 13 0x564cf44cd500 mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5
    -2> 2018-08-01 09:57:46.368880 7fc46bf8e700  1 -- 
192.168.6.134:6856/5968 <== client.19838915 192.168.6.135:0/1209706031 
13 ==== mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5 ==== 3425+0+0 
(3985820496 0 0) 0x564
cf44cd500 con 0x564cf2f8d000
    -1> 2018-08-01 09:57:46.368895 7fc46bf8e700  4 mgr.server 
handle_report from 0x564cf2f8d000 rgw,radosgw.gateway
     0> 2018-08-01 09:57:46.371034 7fc46bf8e700 -1 *** Caught signal 
(Aborted) **
 in thread 7fc46bf8e700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) 
luminous (stable)
 1: (()+0x40e744) [0x564ce68e1744]
 2: (()+0x11390) [0x7fc484ede390]
 3: (gsignal()+0x38) [0x7fc483e6e428]
 4: (abort()+0x16a) [0x7fc483e7002a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc4847b184d]
 6: (()+0x8d6b6) [0x7fc4847af6b6]
 7: (()+0x8d701) [0x7fc4847af701]
 8: (()+0x8d919) [0x7fc4847af919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc4847d82cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564ce6775dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564ce677e3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564ce678c5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x564ce6c3baba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564ce69dcaed]
 15: (()+0x76ba) [0x7fc484ed46ba]
 16: (clone()+0x6d) [0x7fc483f4041d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

The cause seems to be the RGW instances in our cluster. We use a HA 
setup with pacemaker and haproxy on three hosts; two different RGW setup 
serve internal and external users (three hosts with two RGW processes 
each overall). As soon as two instance on two different hosts are 
active, the mgrs crash with this stack trace.

I've reduced the number of active RGW instances to one to stop the mgr 
from crashing. Is this a regression in 12.2.7 for HA setups?

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com