Hi, i just uploaded one of 196 crash reports. https://tracker.ceph.com/issues/46443 I tried to debug it myself... but it took to much time. Markus On 06.07.20 11:00, Dan van der Ster wrote: > Hi Markus, > > Did you make any progress with this? > (Selfishly pinging to better understand whether the 14.2.9 to 14.2.10 > upgrade is safe or not) > > Cheers, Dan > > > On Wed, Jul 1, 2020 at 9:30 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> >> Hi Markus, >> >> Yes, I think you should open a bug tracker with more from a crashing >> osd log file (e.g. all the -1> -2> etc. lines before the crash) and >> also from the mon leader if possible. >> >> Something strange is that the mon_warn_on_pool_pg_num_not_power_of_two >> feature is also present in v14.2.9 (it was added in v14.2.8). Which >> version did you upgrade from? Perhaps setting it to false was the >> trigger, but the crash is somewhere else in the OSD changes in >> v14.2.10. >> >> Cheers, Dan >> >> >> On Wed, Jul 1, 2020 at 9:09 AM Markus Binz <mbinz@xxxxxxxxx> wrote: >>> >>> Hello, >>> >>> yesterday we upgraded a mimic cluster to v14.2.10, everything was running and ok. >>> >>> There was this new warning, 2 pool(s) have non-power-of-two pg_num and to get a HEALTH_OK state until we can expand this pools, >>> i found this config option to suppress the warning: >>> >>> ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false which resulted in a crash of 40 osd processes (about 60% of the cluster). >>> >>> no restart possible, always the same crash. >>> >>> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true} >>> *** Caught signal (Segmentation fault) ** >>> in thread 7fd2a5813700 thread_name:fn_odsk_fstore >>> ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) >>> 1: (()+0x11390) [0x7fd2b53a3390] >>> 2: /usr/bin/ceph-osd() [0x87fd12] >>> 3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91] >>> 4: (C_OnMapCommit::finish(int)+0x17) [0x946897] >>> 5: (Context::complete(int)+0x9) [0x8fbfb9] >>> 6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e] >>> 7: (()+0x76ba) [0x7fd2b53996ba] >>> 8: (clone()+0x6d) [0x7fd2b49a041d] >>> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) ** >>> in thread 7fd2a5813700 thread_name:fn_odsk_fstore >>> >>> ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) >>> 1: (()+0x11390) [0x7fd2b53a3390] >>> 2: /usr/bin/ceph-osd() [0x87fd12] >>> 3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91] >>> 4: (C_OnMapCommit::finish(int)+0x17) [0x946897] >>> 5: (Context::complete(int)+0x9) [0x8fbfb9] >>> 6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e] >>> 7: (()+0x76ba) [0x7fd2b53996ba] >>> 8: (clone()+0x6d) [0x7fd2b49a041d] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>> >>> -1547> 2020-06-30 21:13:51.171 7fd2b7708c00 -1 missing 'type' file, inferring filestore from current/ dir >>> -738> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true} >>> 0> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) ** >>> in thread 7fd2a5813700 thread_name:fn_odsk_fstore >>> >>> ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) >>> 1: (()+0x11390) [0x7fd2b53a3390] >>> 2: /usr/bin/ceph-osd() [0x87fd12] >>> 3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91] >>> 4: (C_OnMapCommit::finish(int)+0x17) [0x946897] >>> 5: (Context::complete(int)+0x9) [0x8fbfb9] >>> 6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e] >>> 7: (()+0x76ba) [0x7fd2b53996ba] >>> 8: (clone()+0x6d) [0x7fd2b49a041d] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>> >>> -1547> 2020-06-30 21:13:51.171 7fd2b7708c00 -1 missing 'type' file, inferring filestore from current/ dir >>> -738> 2020-06-30 21:13:56.179 7fd2b7708c00 -1 osd.30 385679 log_to_monitors {default=true} >>> 0> 2020-06-30 21:13:56.199 7fd2a5813700 -1 *** Caught signal (Segmentation fault) ** >>> in thread 7fd2a5813700 thread_name:fn_odsk_fstore >>> >>> ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) >>> 1: (()+0x11390) [0x7fd2b53a3390] >>> 2: /usr/bin/ceph-osd() [0x87fd12] >>> 3: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x5e1) [0x8f0f91] >>> 4: (C_OnMapCommit::finish(int)+0x17) [0x946897] >>> 5: (Context::complete(int)+0x9) [0x8fbfb9] >>> 6: (Finisher::finisher_thread_entry()+0x15e) [0xeb2b8e] >>> 7: (()+0x76ba) [0x7fd2b53996ba] >>> 8: (clone()+0x6d) [0x7fd2b49a041d] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>> >>> This is a mixed cluster of ubuntu xenial and bionic, it happens on both. >>> >>> It look's like, it happens when the new monmap arrived at the osd. >>> >>> The only fix i was able to come up with, downgrade ceph-osd to v14.2.9. >>> >>> Should i open a bug report? >>> >>> Regards >>> >>> Markus >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Markus Binz, mbinz@xxxxxxxxx, MB44-RIPE, PGPKEY-ABC5F050 SolNet, Internet Solution Provider Phone: +41 32 517 6223 Fax: +41 32 685 9613 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx