Re: v14.2.10 Nautilus crash

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 10 Jul 2020 23:52:15 +0200

Looks like the problem starts here:

 -7145> 2020-06-30 21:27:08.626 7fcb54b0d700  2 osd.30 385679 got
incremental 385680 but failed to encode full with correct crc;
requesting
 -7139> 2020-06-30 21:27:08.626 7fcb54b0d700  0 log_channel(cluster)
log [WRN] : failed to encode map e385680 with expected crc

then eventually there's a crash in _committed_osd_maps.

commit fa842716b6dc3b2077e296d388c646f1605568b0 touched the osdmap
code in 14.2.10 so I wonder if there's a bug in there.

Otherwise, the question I ask everyone with osdmap issues these days:
are you using bluestore compression and lz4?

Cheers, Dan

On Fri, Jul 10, 2020 at 9:45 AM Markus Binz <mbinz@xxxxxxxxx> wrote:
>
> Hi,
>
> i just uploaded one of 196 crash reports.
>
> https://tracker.ceph.com/issues/46443
>
> I tried to debug it myself... but it took to much time.
>
> Markus
>
> On 06.07.20 11:00, Dan van der Ster wrote:
> > Hi Markus,
> >
> > Did you make any progress with this?
> > (Selfishly pinging to better understand whether the 14.2.9 to 14.2.10
> > upgrade is safe or not)
> >
> > Cheers, Dan
> >
> >
> > On Wed, Jul 1, 2020 at 9:30 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Markus,
> >>
> >> Yes, I think you should open a bug tracker with more from a crashing
> >> osd log file (e.g. all the -1> -2> etc. lines before the crash) and
> >> also from the mon leader if possible.
> >>
> >> Something strange is that the mon_warn_on_pool_pg_num_not_power_of_two
> >> feature is also present in v14.2.9 (it was added in v14.2.8). Which
> >> version did you upgrade from? Perhaps setting it to false was the
> >> trigger, but the crash is somewhere else in the OSD changes in
> >> v14.2.10.
> >>
> >> Cheers, Dan
> >>
> >>
> >> On Wed, Jul 1, 2020 at 9:09 AM Markus Binz <mbinz@xxxxxxxxx> wrote:
> >>>
> >>> Hello,
> >>>
> >>> yesterday we upgraded a mimic cluster to v14.2.10, everything was running and ok.
> >>>
> >>> There was this new warning, 2 pool(s) have non-power-of-two pg_num and to get a HEALTH_OK state until we can expand this pools,
> >>> i found this config option to suppress the warning:
> >>>
> >>> ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false which resulted in a crash of 40 osd processes (about 60% of the cluster).
> >>>
> >>> no restart possible, always the same crash.
> >>>
> >>> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true}
> >>> *** Caught signal (Segmentation fault) **
> >>>   in thread 7fd2a5813700 thread_name:fn_odsk_fstore
> >>>   ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
> >>>   1: (()+0x11390) [0x7fd2b53a3390]
> >>>   2: /usr/bin/ceph-osd() [0x87fd12]
> >>>   3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91]
> >>>   4: (C_OnMapCommit::finish(int)+0x17) [0x946897]
> >>>   5: (Context::complete(int)+0x9) [0x8fbfb9]
> >>>   6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e]
> >>>   7: (()+0x76ba) [0x7fd2b53996ba]
> >>>   8: (clone()+0x6d) [0x7fd2b49a041d]
> >>> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) **
> >>>   in thread 7fd2a5813700 thread_name:fn_odsk_fstore
> >>>
> >>>   ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
> >>>   1: (()+0x11390) [0x7fd2b53a3390]
> >>>   2: /usr/bin/ceph-osd() [0x87fd12]
> >>>   3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91]
> >>>   4: (C_OnMapCommit::finish(int)+0x17) [0x946897]
> >>>   5: (Context::complete(int)+0x9) [0x8fbfb9]
> >>>   6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e]
> >>>   7: (()+0x76ba) [0x7fd2b53996ba]
> >>>   8: (clone()+0x6d) [0x7fd2b49a041d]
> >>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >>>
> >>>   -1547> 2020-06-30 21:13:51.171 7fd2b7708c00 -1 missing 'type' file, inferring filestore from current/ dir
> >>>    -738> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true}
> >>>       0> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) **
> >>>   in thread 7fd2a5813700 thread_name:fn_odsk_fstore
> >>>
> >>>   ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
> >>>   1: (()+0x11390) [0x7fd2b53a3390]
> >>>   2: /usr/bin/ceph-osd() [0x87fd12]
> >>>   3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91]
> >>>   4: (C_OnMapCommit::finish(int)+0x17) [0x946897]
> >>>   5: (Context::complete(int)+0x9) [0x8fbfb9]
> >>>   6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e]
> >>>   7: (()+0x76ba) [0x7fd2b53996ba]
> >>>   8: (clone()+0x6d) [0x7fd2b49a041d]
> >>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >>>
> >>>   -1547> 2020-06-30 21:13:51.171 7fd2b7708c00 -1 missing 'type' file, inferring filestore from current/ dir
> >>>    -738> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true}
> >>>       0> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) **
> >>>   in thread 7fd2a5813700 thread_name:fn_odsk_fstore
> >>>
> >>>   ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
> >>>   1: (()+0x11390) [0x7fd2b53a3390]
> >>>   2: /usr/bin/ceph-osd() [0x87fd12]
> >>>   3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91]
> >>>   4: (C_OnMapCommit::finish(int)+0x17) [0x946897]
> >>>   5: (Context::complete(int)+0x9) [0x8fbfb9]
> >>>   6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e]
> >>>   7: (()+0x76ba) [0x7fd2b53996ba]
> >>>   8: (clone()+0x6d) [0x7fd2b49a041d]
> >>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >>>
> >>> This is a mixed cluster of ubuntu xenial and bionic, it happens on both.
> >>>
> >>> It look's like, it happens when the new monmap arrived at the osd.
> >>>
> >>> The only fix i was able to come up with, downgrade ceph-osd to v14.2.9.
> >>>
> >>> Should i open a bug report?
> >>>
> >>> Regards
> >>>
> >>> Markus
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> --
> Markus Binz, mbinz@xxxxxxxxx, MB44-RIPE, PGPKEY-ABC5F050
> SolNet, Internet Solution Provider
> Phone:  +41 32 517 6223
> Fax:    +41 32 685 9613
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx