Re: ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

Brad Hubbard <bhubbard@xxxxxxxxxx> · Tue, 12 Sep 2017 15:16:12 +1000

On Tue, Sep 12, 2017 at 3:12 PM, Katie Holly <holly@xxxxxxxxx> wrote:
> Ben and Brad,
>
> big thanks to both of you for helping me track down this issue which - seemingly - was caused by more than one radosgw instance sharing the exact same --name value and solved by generating unique keys and --name values for each single radosgw instance.
>
> Right now, all ceph-mgr daemons seem to run perfectly stable, but I'll definitely keep a close eye on the cluster and report back if I see any other issues.
>
> I updated the tracker to include this information as well so developers can hopefully fix this nasty bug or at least include a warning somewhere that one shouldn't run a setup like this.
>
> http://tracker.ceph.com/issues/21197#note-4

Thanks for letting us know the result Katie. I'm sure this issue will
receive some love in the not too distant future :)

>
> --
> Katie
> On 2017-09-12 06:20, Katie Holly wrote:
>> They all share the exact same exec arguments, so yes, they all have the same --name as well. I'll try to run them with different --name parameters to see if that solves the issue.
>>
>> --
>> Katie
>>
>> On 2017-09-12 06:13, Ben Hines wrote:
>>> Do the docker containers all have the same rgw --name ? Maybe that is confusing ceph...
>>>
>>> On Mon, Sep 11, 2017 at 9:11 PM, Katie Holly <holly@xxxxxxxxx <mailto:holly@xxxxxxxxx>> wrote:
>>>
>>>     All radosgw instances are running
>>>     > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
>>>     as Docker containers, there are 15 of them at any possible time
>>>
>>>
>>>     The "config"/exec-args for the radosgw instances are:
>>>
>>>     /usr/bin/radosgw \
>>>       -d \
>>>       --cluster=ceph \
>>>       --conf=/dev/null \
>>>       --debug-ms=0 \
>>>       --debug-rgw=0/0 \
>>>       --keyring=/etc/ceph/ceph.client.rgw.docker.keyring \
>>>       --logfile=/dev/null \
>>>       --mon-host=mon.ceph.fks.de.fvz.io <http://mon.ceph.fks.de.fvz.io> \
>>>       --name=client.rgw.docker \
>>>       --rgw-content-length-compat=true \
>>>       --rgw-dns-name=de-fks-1.rgw.li <http://de-fks-1.rgw.li> \
>>>       --rgw-region=eu \
>>>       --rgw-zone=eu-de-fks-1 \
>>>       --setgroup=ceph \
>>>       --setuser=ceph
>>>
>>>
>>>     Scaling this Docker radosgw cluster down to just 1 instance seems to allow ceph-mgr to run without issues, but as soon as I increase the amount of radosgw instances, the risk of ceph-mgr crashing at any random time also increases.
>>>
>>>     It seems that 2 radosgw instances are also fine, just anything higher than that is not and causes issues. Maybe a race condition?
>>>
>>>     --
>>>     Katie
>>>     On 2017-09-12 05:24, Brad Hubbard wrote:
>>>     > It seems like it's choking on the report from the rados gateway. What
>>>     > version is the rgw node running?
>>>     >
>>>     > If possible, could you shut down the rgw and see if you can then start ceph-mgr?
>>>     >
>>>     > Pure stab in the dark just to see if the problem is tied to the rgw instance.
>>>     >
>>>     > On Tue, Sep 12, 2017 at 1:07 PM, Katie Holly <holly@xxxxxxxxx <mailto:holly@xxxxxxxxx>> wrote:
>>>     >> Thanks, I totally forgot to check the tracker. I added the information I collected there, but don't have enough experience with ceph to dig through this myself so let's see if someone is willing to sacrifice their free time to help debugging this issue.
>>>     >>
>>>     >> --
>>>     >> Katie
>>>     >>
>>>     >> On 2017-09-12 03:15, Brad Hubbard wrote:
>>>     >>> Looks like there is a tracker opened for this.
>>>     >>>
>>>     >>> http://tracker.ceph.com/issues/21197 <http://tracker.ceph.com/issues/21197>
>>>     >>>
>>>     >>> Please add your details there.
>>>     >>>
>>>     >>> On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly <holly@xxxxxxxxx <mailto:holly@xxxxxxxxx>> wrote:
>>>     >>>> Hi,
>>>     >>>>
>>>     >>>> I recently upgraded one of our clusters from Kraken to Luminous (the cluster was initialized with Jewel) on Ubuntu 16.04 and deployed ceph-mgr on all of our ceph-mon nodes with ceph-deploy.
>>>     >>>>
>>>     >>>> Related log entries after initial deployment of ceph-mgr:
>>>     >>>>
>>>     >>>> 2017-09-11 06:41:53.535025 7fb5aa7b8500  0 set uid:gid to 64045:64045 (ceph:ceph)
>>>     >>>> 2017-09-11 06:41:53.535048 7fb5aa7b8500  0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), pid 17031
>>>     >>>> 2017-09-11 06:41:53.536853 7fb5aa7b8500  0 pidfile_write: ignore empty --pid-file
>>>     >>>> 2017-09-11 06:41:53.541880 7fb5aa7b8500  1 mgr send_beacon standby
>>>     >>>> 2017-09-11 06:41:54.547383 7fb5a1aec700  1 mgr handle_mgr_map Activating!
>>>     >>>> 2017-09-11 06:41:54.547575 7fb5a1aec700  1 mgr handle_mgr_map I am now activating
>>>     >>>> 2017-09-11 06:41:54.650677 7fb59dae4700  1 mgr start Creating threads for 0 modules
>>>     >>>> 2017-09-11 06:41:54.650696 7fb59dae4700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 06:41:55.542252 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 06:41:55.542627 7fb59eae6700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
>>>     >>>> 2017-09-11 06:41:57.542697 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> [... lots of "send_beacon active" messages ...]
>>>     >>>> 2017-09-11 07:29:29.640892 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
>>>     >>>>  in thread 7fb59d2e3700 thread_name:ms_dispatch
>>>     >>>>
>>>     >>>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
>>>     >>>>  1: (()+0x3de6b4) [0x55f6640e16b4]
>>>     >>>>  2: (()+0x11390) [0x7fb5a8fef390]
>>>     >>>>  3: (gsignal()+0x38) [0x7fb5a7f7f428]
>>>     >>>>  4: (abort()+0x16a) [0x7fb5a7f8102a]
>>>     >>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
>>>     >>>>  6: (()+0x8d6b6) [0x7fb5a88c06b6]
>>>     >>>>  7: (()+0x8d701) [0x7fb5a88c0701]
>>>     >>>>  8: (()+0x8d919) [0x7fb5a88c0919]
>>>     >>>>  9: (()+0x2318ad) [0x55f663f348ad]
>>>     >>>>  10: (()+0x3e91bd) [0x55f6640ec1bd]
>>>     >>>>  11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
>>>     >>>>  12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+
>>>     >>>>  13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
>>>     >>>>  14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
>>>     >>>>  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
>>>     >>>>  16: (()+0x76ba) [0x7fb5a8fe56ba]
>>>     >>>>  17: (clone()+0x6d) [0x7fb5a80513dd]
>>>     >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>     >>>>
>>>     >>>> --- begin dump of recent events ---
>>>     >>>> [...]
>>>     >>>>
>>>     >>>>
>>>     >>>> I tried to manually run ceph-mgr with
>>>     >>>>> /usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph --setgroup ceph
>>>     >>>> which immediately fails to keep running for longer than a few seconds.
>>>     >>>> stdout: http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt <http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt>
>>>     >>>> logs: http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt <http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt>
>>>     >>>> objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt <http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt>
>>>     >>>>
>>>     >>>> Has someone seen such an issue before and knows how to debug or even fix this?
>>>     >>>>
>>>     >>>>
>>>     >>>> --
>>>     >>>> Katie
>>>     >>>> _______________________________________________
>>>     >>>> ceph-users mailing list
>>>     >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>     >>>
>>>     >>>
>>>     >>>
>>>     >
>>>     >
>>>     >
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com