Fwd: OSD crash after change of osd_memory_target

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

unfortunately same result:

# ceph config dump
WHO   MASK LEVEL OPTION            VALUE      RO
  osd      basic osd_memory_target 2147483648   

# /usr/bin/ceph-osd -d --cluster ceph --id 0 --setuser ceph --setgroup ceph
....
     0> 2020-01-23 10:48:04.436 7fc61b5b5c80 -1 *** Caught signal
(Aborted) **
 in thread 7fc61b5b5c80 thread_name:ceph-osd

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus
(stable)
 1: (()+0x12730) [0x7fc61c05d730]
 2: (gsignal()+0x10b) [0x7fc61bb417bb]
 3: (abort()+0x121) [0x7fc61bb2c535]
 4: (()+0x8c983) [0x7fc61bef4983]
 5: (()+0x928c6) [0x7fc61befa8c6]
 6: (()+0x92901) [0x7fc61befa901]
 7: (()+0x92b34) [0x7fc61befab34]
 8: (()+0x5a3f53) [0x55ecdabb2f53]
 9: (Option::size_t const
md_config_t::get_val<Option::size_t>(ConfigValues const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&) const+0x81) [0x55ecdabb8c91]
 10: (BlueStore::_set_cache_sizes()+0x15a) [0x55ecdb033d8a]
 11: (BlueStore::_open_bdev(bool)+0x173) [0x55ecdb036b23]
 12: (BlueStore::get_devices(std::set<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > >*)+0xef) [0x55ecdb09d7ef]
 13: (BlueStore::get_numa_node(int*, std::set<int, std::less<int>,
std::allocator<int> >*, std::set<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > >*)+0x7b) [0x55ecdb04571b]
 14: (main()+0x2870) [0x55ecdab80440]
 15: (__libc_start_main()+0xeb) [0x7fc61bb2e09b]
 16: (_start()+0x2a) [0x55ecdabb2c6a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
....

Best Regards,

Martin

Dne 22. 01. 20 v 16:33 Igor Fedotov napsal(a):
>
> Hi Martin,
>
> looks like a bug to me.
>
> You might want to remove all custom settings from config database and
> try to set osd-memory-target only.
>
> Would it help?
>
>
> Thanks,
>
> Igor
>
> On 1/22/2020 3:43 PM, Martin Mlynář wrote:
>>
>>
>> Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a):
>>> Quoting Martin Mlynář (nexus+ceph@xxxxxxxxxx):
>>>
>>>> Do you think this could help? OSD does not even start, I'm getting a little
>>>> lost how flushing caches could help.
>>> I might have mis-understood. I though the OSDs crashed when you set the
>>> config setting.
>>>
>>>> According to trace I suspect something around processing config values.
>>> I've just set the same config setting on a test cluster and restarted an
>>> OSD without problem. So, not sure what is going on there.
>>>
>>> Gr. Stefan
>>
>> I've compiled ceph-osd with debug symbols and got better backtrace:
>>
>>    -24> 2020-01-22 13:12:53.614 7f83ed064700  4 set_mon_vals no
>> callback set
>>    -23> 2020-01-22 13:12:53.614 7f83ee867700 10 monclient: discarding
>> stray monitor message auth_reply(proto 2 0 (0) Success) v1
>>    -22> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_crush_update_on_start = true
>>    -21> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_max_backfills = 64
>>    -20> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_memory_target = 2147483648
>>    -19> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_recovery_max_active = 40
>>    -18> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_recovery_max_single_start = 1000
>>    -17> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_recovery_sleep_hdd = 0.000000
>>    -16> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals
>> osd_recovery_sleep_hybrid = 0.000000
>>    -15> 2020-01-22 13:12:53.627 7f83f0276c40  0 set uid:gid to
>> 64045:64045 (ceph:ceph)
>>    -14> 2020-01-22 13:12:53.627 7f83f0276c40  0 ceph version 14.2.6
>> (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process
>> ceph-osd, pid 1111622
>>    -13> 2020-01-22 13:12:53.649 7f83f0276c40  0 pidfile_write: ignore
>> empty --pid-file
>>    -12> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> init /var/run/ceph/ceph-osd.6.asok
>>    -11> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> bind_and_listen /var/run/ceph/ceph-osd.6.asok
>>    -10> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> register_command 0 hook 0x558051872fc0
>>     -9> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> register_command version hook 0x558051872fc0
>>     -8> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> register_command git_version hook 0x558051872fc0
>>     -7> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> register_command help hook 0x558051874220
>>     -6> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000)
>> register_command get_command_descriptions hook 0x558051874260
>>     -5> 2020-01-22 13:12:53.657 7f83ed865700  5 asok(0x5580518fa000)
>> entry start
>>     -4> 2020-01-22 13:12:53.670 7f83f0276c40  5 object store type is
>> bluestore
>>     -3> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev create path
>> /var/lib/ceph/osd/ceph-6/block type kernel
>>     -2> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev(0x5580518f3f80
>> /var/lib/ceph/osd/ceph-6/block) open path /var/lib/ceph/osd/ceph-6/block
>>     -1> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev(0x5580518f3f80
>> /var/lib/ceph/osd/ceph-6/block) open size 3000588304384
>> (0x2baa1000000, 2.7 TiB) block_size 4096 (4 KiB) rotational discard
>> not supported
>>      0> 2020-01-22 13:12:53.714 7f83f0276c40 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f83f0276c40 thread_name:ceph-osd
>>
>>  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9)
>> nautilus (stable)
>>  1: (()+0x2c19654) [0x558045ec6654]
>>  2: (()+0x12730) [0x7f83f0d1f730]
>>  3: (gsignal()+0x10b) [0x7f83f08027bb]
>>  4: (abort()+0x121) [0x7f83f07ed535]
>>  5: (()+0x8c983) [0x7f83f0bb5983]
>>  6: (()+0x928c6) [0x7f83f0bbb8c6]
>>  7: (()+0x92901) [0x7f83f0bbb901]
>>  8: (()+0x92b34) [0x7f83f0bbbb34]
>> * 9: (void boost::throw_exception<boost::bad_get>(boost::bad_get
>> const&)+0x7b) [0x5580454d5430]*
>> * 10: (Option::size_t&& boost::relaxed_get<Option::size_t,
>> boost::blank, std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, unsigned long, long,
>> double, bool, entity_addr_t, entity_addrvec_t,
>> std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, unsigned long, long,
>> double, bool, entity_addr_t, entity_addrvec_t,
>> std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>&&)+0x5b) [0x5580454d6223]*
>>  11: (Option::size_t&& boost::strict_get<Option::size_t,
>> boost::blank, std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, unsigned long, long,
>> double, bool, entity_addr_t, entity_addrvec_t,
>> std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, unsigned long, long,
>> double, bool, entity_addr_t, entity_addrvec_t,
>> std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>&&)+0x20) [0x5580454d4a39]
>>  12: (Option::size_t&& boost::get<Option::size_t, boost::blank,
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> >, unsigned long, long, double, bool,
>> entity_addr_t, entity_addrvec_t, std::chrono::duration<long,
>> std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, unsigned long, long,
>> double, bool, entity_addr_t, entity_addrvec_t,
>> std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t,
>> uuid_d>&&)+0x20) [0x5580454d1ed7]
>>  13: (Option::size_t const
>> md_config_t::get_val<Option::size_t>(ConfigValues const&,
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > const&) const+0x48) [0x5580454ce882]
>>  14: (Option::size_t const
>> ConfigProxy::get_val<Option::size_t>(std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > const&) const+0x58)
>> [0x5580454cb9b8]
>>  15: (BlueStore::_set_cache_sizes()+0x159) [0x558045ce2213]
>>  16: (BlueStore::_open_bdev(bool)+0x301) [0x558045ce6be3]
>>  17:
>> (BlueStore::get_devices(std::set<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >,
>> std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > >,
>> std::allocator<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > > >*)+0xf9)
>> [0x558045d0f16d]
>>  18: (BlueStore::get_numa_node(int*, std::set<int, std::less<int>,
>> std::allocator<int> >*, std::set<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >,
>> std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > >,
>> std::allocator<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > > >*)+0x79)
>> [0x558045d0eb55]
>>  19: (main()+0x3aae) [0x5580454c2460]
>>  20: (__libc_start_main()+0xeb) [0x7f83f07ef09b]
>>  21: (_start()+0x2a) [0x5580454bda2a]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> And managed some gdb debugging (int BlueStore::_set_cache_sizes()):
>>
>> (gdb) n
>> 4116      cache_autotune_interval =
>> (gdb) n
>> 4117         
>> cct->_conf.get_val<double>("bluestore_cache_autotune_interval");
>> (gdb) p cache_autotune_interval
>> $3 = 5
>> (gdb) n
>> 4118      osd_memory_target =
>> cct->_conf.get_val<Option::size_t>("osd_memory_target");
>> (gdb) s
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> >::basic_string<std::allocator<char> >
>> (this=0x7fffffffc140, __s=0x555558d26c2f "osd_memory_target", __a=...)
>>     at /usr/include/c++/8/bits/basic_string.h:515
>> 515          : _M_dataplus(_M_local_data(), __a)
>> (gdb) n
>> 516          { _M_construct(__s, __s ? __s + traits_type::length(__s)
>> : __s+npos); }
>> (gdb)
>> terminate called after throwing an instance of
>> 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_get>
>> >'
>>   what():  boost::bad_get: failed value get using boost::get
>>
>> But there I'm stuck. GDBing c++ code is a really dark sorcery for me.
>>
>> Other get_vals look fine, maybe get_val<*Option::size_t*> is the
>> problem? It looks like trouble outside of ceph - what system are you
>> testing on? This is debian with official debian build from
>> buster-backports. Maybe some of debian's patches?
>>
>> -- 
>> Martin Mlynář
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux