Only one of the servers is failing of the 4 that have access to the cluster.
since we where in the mid of an update the kernel and os versions are bit all over the place, but here we go:
bad:
Debian 10 (buster) - 5.10.0-0.bpo.8-amd64 #1 SMP Debian 5.10.46-2~bpo10+1 (2021-07-22) x86_64 GNU/Linux
good:
Debian 10 (buster) - 5.10.0-0.bpo.4-amd64 #1 SMP Debian 5.10.19-1~bpo10+1 (2021-03-13) x86_64 GNU/Linux
Debian 11 (bullseye) - 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux
Debian 11 (bullseye) - 5.10.0-8-amd64 #1 SMP Debian 5.10.46-3 (2021-07-28) x86_64 GNU/Linux
the proxmox one is at:
ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable)
the rest is at:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
all 3 other servers are running all ceph commands fine.
and 2 of them have mon running fine too.
since we where in the mid of an update the kernel and os versions are bit all over the place, but here we go:
bad:
Debian 10 (buster) - 5.10.0-0.bpo.8-amd64 #1 SMP Debian 5.10.46-2~bpo10+1 (2021-07-22) x86_64 GNU/Linux
good:
Debian 10 (buster) - 5.10.0-0.bpo.4-amd64 #1 SMP Debian 5.10.19-1~bpo10+1 (2021-03-13) x86_64 GNU/Linux
Debian 11 (bullseye) - 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux
Debian 11 (bullseye) - 5.10.0-8-amd64 #1 SMP Debian 5.10.46-3 (2021-07-28) x86_64 GNU/Linux
the proxmox one is at:
ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable)
the rest is at:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
all 3 other servers are running all ceph commands fine.
and 2 of them have mon running fine too.
the 3 mon where on the bad one which is not starting with the same error.
we stopped the bullseye update since we may kill a second server which will be bad for the cluster ;)
as soon its as possible we will try going back to 5.10.0-0.bpo.4-amd64 on the bad one,
since this is known to work on one of the other servers.
the hardware of the server can not really be compared since this is a testing / dev cluster and composed of old hardware.
we stopped the bullseye update since we may kill a second server which will be bad for the cluster ;)
as soon its as possible we will try going back to 5.10.0-0.bpo.4-amd64 on the bad one,
since this is known to work on one of the other servers.
the hardware of the server can not really be compared since this is a testing / dev cluster and composed of old hardware.
And we tried the update there first to not kill the production environment.
Am So., 15. Aug. 2021 um 00:13 Uhr schrieb Brad Hubbard <bhubbard@xxxxxxxxxx>:
On Sat, Aug 14, 2021 at 7:13 PM Links 2004 <links2004.code@xxxxxxxxx> wrote:
>
> no kernel update in the last time, and its a server so no keyboard etc, but the entropy_avail looks good and its in the same range (-+80) as the other server.
> dmesg | grep random has no results.
> # cat /proc/sys/kernel/random/entropy_avail
> 3547
>
> all servers run
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
Can you give us some more information?
Do all of your servers fail all of the time or do some fail some of the time?
What do the ones that fail have in common and how do they differ from
the ones that do not fail?
If you have the option to try alternative kernels on the servers that
exhibit this behavior I'd suggest that might be a good next step.
>
>
>
>
> Am Sa., 14. Aug. 2021 um 10:55 Uhr schrieb kefu chai <tchaikov@xxxxxxxxx>:
>>
>>
>>
>> Brad Hubbard <bhubbard@xxxxxxxxxx>于2021年8月14日 周六06:11写道:
>>>
>>> On Sat, Aug 14, 2021 at 4:06 AM Links 2004 <links2004.code@xxxxxxxxx> wrote:
>>> >
>>> >
>>> > Hi,
>>> >
>>> > we are currently facing a strange problem on on of our ceph nodes.
>>> > it is not possible to call `ceph -s` or start a mgr with out a 'std::runtime_error'.
>>> >
>>> > find below the error message and a gdb backtrace with debug symbols.
>>> > hope this helps to understand the problem and point us in the correct direction.
>>> >
>>> > Thanks
>>> >
>>> > Markus
>>> >
>>> >
>>> > OS: Debian buster
>>> > kernel: 5.10.0-0.bpo.8-amd64 #1 SMP Debian 5.10.46-2~bpo10+1 (2021-07-22) x86_64 GNU/Linux
>>> >
>>> > ```
>>> > # ceph -v
>>> > ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
>>> > ```
>>> >
>>> > ```
>>> > # ceph -s
>>> > terminate called after throwing an instance of 'std::runtime_error'
>>> > what(): random_device::__x86_rdrand(void)
>>> > Aborted
>>> > ```
>>>
>>> Did this issue coincide with a kernel upgrade?
>>>
>>> Can you try and generate a lot of entropy on the system and see if the
>>> issue goes away?
>>>
>>> Also check the output of 'dmesg | grep random' to see if that offers any clues.
>>
>>
>> I feel the same. Looks likely that the kernel did not have enough entropy by then. Is the system not connected to a keyboard? Or it was just booted? If that’s the case, probably you could wait a while before try to launch mgr or use the ceph command line utility.
>>>
>>>
>>>
>>> >
>>> > ```
>>> > # gdb --args /usr/bin/python3.7 /usr/bin/ceph -s
>>> > GNU gdb (Debian 8.2.1-2+b3) 8.2.1
>>> > Copyright (C) 2018 Free Software Foundation, Inc.
>>> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>> > This is free software: you are free to change and redistribute it.
>>> > There is NO WARRANTY, to the extent permitted by law.
>>> > Type "show copying" and "show warranty" for details.
>>> > This GDB was configured as "x86_64-linux-gnu".
>>> > Type "show configuration" for configuration details.
>>> > For bug reporting instructions, please see:
>>> > <http://www.gnu.org/software/gdb/bugs/>.
>>> > Find the GDB manual and other documentation resources online at:
>>> > <http://www.gnu.org/software/gdb/documentation/>.
>>> >
>>> > For help, type "help".
>>> > Type "apropos word" to search for commands related to "word"...
>>> > Reading symbols from /usr/bin/python3.7...Reading symbols from /usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done.
>>> > done.
>>> > (gdb) run
>>> > Starting program: /usr/bin/python3.7 /usr/bin/ceph -s
>>> > [Thread debugging using libthread_db enabled]
>>> > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>> >
>>> > [New Thread 0x7fffed515700 (LWP 30323)]
>>> > [New Thread 0x7fffe7fff700 (LWP 30324)]
>>> > [New Thread 0x7fffe77fe700 (LWP 30325)]
>>> > [Thread 0x7fffed515700 (LWP 30323) exited]
>>> > [New Thread 0x7fffed515700 (LWP 30326)]
>>> > [Thread 0x7fffe7fff700 (LWP 30324) exited]
>>> > terminate called after throwing an instance of 'std::runtime_error'
>>> > what(): random_device::__x86_rdrand(void)
>>> >
>>> > Thread 4 "python3.7" received signal SIGABRT, Aborted.
>>> > [Switching to Thread 0x7fffe77fe700 (LWP 30325)]
>>> > __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> > 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>> > (gdb)
>>> > (gdb) bt
>>> > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> > #1 0x00007ffff79de535 in __GI_abort () at abort.c:79
>>> > #2 0x00007fffeddb8983 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #3 0x00007fffeddbe8c6 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #4 0x00007fffeddbe901 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #5 0x00007fffeddbeb34 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #6 0x00007fffeddba8b7 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #7 0x00007fffedde6e86 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #8 0x00007fffedde6fd2 in std::random_device::_M_getval() () from /lib/x86_64-linux-gnu/libstdc++.so.6
>>> > #9 0x00007fffee540ffc in std::random_device::operator() (this=0x7fffe77fba60) at /usr/include/c++/8/bits/random.h:1611
>>> > #10 ceph::util::version_1_0_3::detail::randomize_rng<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > () at ./src/include/random.h:120
>>> > #11 0x00007fffee54112f in ceph::util::version_1_0_3::detail::engine<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > ()
>>> > at /usr/include/c++/8/new:169
>>> > #12 0x00007fffee772911 in ceph::util::version_1_0_3::detail::generate_random_number<unsigned long, std::uniform_int_distribution<unsigned long>, std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > (min=min@entry=0, max=max@entry=18446744073709551615) at ./src/include/random.h:170
>>> > #13 0x00007fffee7718fe in ceph::util::version_1_0_3::generate_random_number<unsigned long, std::uniform_int_distribution<unsigned long>, std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > () at ./src/include/random.h:203
>>> > #14 Messenger::get_random_nonce () at ./src/msg/Messenger.cc:33
>>> > #15 0x00007fffee771dd6 in Messenger::create_client_messenger (cct=0x7fffe80011c0, lname="temp_mon_client") at ./src/msg/Messenger.cc:16
>>> > #16 0x00007fffee82c862 in MonClient::get_monmap_and_config (this=this@entry=0x7fffe77fd080) at /usr/include/c++/8/ext/new_allocator.h:79
>>> > #17 0x00007ffff6fd9e14 in librados::v14_2_0::RadosClient::connect (this=this@entry=0x7fffe805ae80) at ./src/librados/RadosClient.cc:227
>>> > #18 0x00007ffff6f66a0f in _rados_connect (cluster=0x7fffe805ae80) at ./src/librados/librados_c.cc:204
>>> > #19 0x00007ffff71bf0b5 in ?? () from /usr/lib/python3/dist-packages/rados.cpython-37m-x86_64-linux-gnu.so
>>> > #20 0x00007ffff71380ec in ?? () from /usr/lib/python3/dist-packages/rados.cpython-37m-x86_64-linux-gnu.so
>>> > #21 0x00000000004d9850 in _PyObject_FastCallDict (kwargs={}, nargs=1, args=0x7fffe77fd880, callable=<cython_function_or_method at remote 0x7fffed6ae100>)
>>> > at ../Objects/call.c:125
>>> > #22 _PyObject_Call_Prepend (kwargs={}, args=<optimized out>, obj=<optimized out>, callable=<cython_function_or_method at remote 0x7fffed6ae100>) at ../Objects/call.c:904
>>> > #23 method_call (method=<optimized out>, args=<optimized out>, kwargs=<optimized out>, method=<optimized out>, args=<optimized out>, kwargs=<optimized out>)
>>> > at ../Objects/classobject.c:309
>>> > #24 0x00000000005dc4f6 in PyObject_Call (callable=<method at remote 0x7ffff7634dc8>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:245
>>> > #25 0x000000000054f987 in do_call_core (kwdict={}, callargs=(), func=<method at remote 0x7ffff7634dc8>) at ../Python/ceval.c:4645
>>> > #26 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3191
>>> > #27 0x00000000005d91fc in PyEval_EvalFrameEx (throwflag=0,
>>> > f=Frame 0x7ffff72c6bb8, for file /usr/lib/python3/dist-packages/ceph_argparse.py, line 1458, in run (self=<RadosThread(args=(), kwargs={}, func=<method at remote 0x7ffff7634dc8>, exception=None, _target=None, _name='Thread-3', _args=(...), _kwargs={}, _daemonic=True, _ident=140737077307136, _tstate_lock=<_thread.lock at remote 0x7fffed63af58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7fffed63ad00>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fffed63ad00>, release=<built-in method release of _thread.lock object at remote 0x7fffed63ad00>, _waiters=<collections.deque at remote 0x7fffed6d5a70>) at remote 0x7fffed533470>, _flag=True) at remote 0x7fffed533438>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7ffff7629708>) at remote 0x7fffed533358>)) at ../Python/ceval.c:547
>>> > #28 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
>>> > --Type <RET> for more, q to quit, c to continue without paging--c
>>> > #29 _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:408
>>> > #30 0x000000000054e7e0 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=0x7fffe77fdb30) at ../Python/ceval.c:4616
>>> > #31 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3110
>>> > #32 0x00000000005d91fc in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7fffe8000b38, for file /usr/lib/python3.7/threading.py, line 917, in _bootstrap_inner (self=<RadosThread(args=(), kwargs={}, func=<method at remote 0x7ffff7634dc8>, exception=None, _target=None, _name='Thread-3', _args=(...), _kwargs={}, _daemonic=True, _ident=140737077307136, _tstate_lock=<_thread.lock at remote 0x7fffed63af58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7fffed63ad00>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fffed63ad00>, release=<built-in method release of _thread.lock object at remote 0x7fffed63ad00>, _waiters=<collections.deque at remote 0x7fffed6d5a70>) at remote 0x7fffed533470>, _flag=True) at remote 0x7fffed533438>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7ffff7629708>) at remote 0x7fffed533358>)) at ../Python/ceval.c:547
>>> > #33 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
>>> > #34 _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:408
>>> > #35 0x000000000054e7e0 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=0x7fffe77fdcc0) at ../Python/ceval.c:4616
>>> > #36 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3110
>>> > #37 0x00000000005da536 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7fffed538048, for file /usr/lib/python3.7/threading.py, line 885, in _bootstrap (self=<RadosThread(args=(), kwargs={}, func=<method at remote 0x7ffff7634dc8>, exception=None, _target=None, _name='Thread-3', _args=(...), _kwargs={}, _daemonic=True, _ident=140737077307136, _tstate_lock=<_thread.lock at remote 0x7fffed63af58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7fffed63ad00>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fffed63ad00>, release=<built-in method release of _thread.lock object at remote 0x7fffed63ad00>, _waiters=<collections.deque at remote 0x7fffed6d5a70>) at remote 0x7fffed533470>, _flag=True) at remote 0x7fffed533438>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7ffff7629708>) at remote 0x7fffed533358>)) at ../Python/ceval.c:547
>>> > #38 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
>>> > #39 _PyFunction_FastCallDict (func=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:322
>>> > #40 0x00000000004d97e2 in _PyObject_FastCallDict (kwargs=0x0, nargs=1, args=0x7fffe77fde00, callable=<function at remote 0x7ffff73ead90>) at ../Objects/call.c:98
>>> > #41 _PyObject_Call_Prepend (kwargs=0x0, args=<optimized out>, obj=<RadosThread(args=(), kwargs={}, func=<method at remote 0x7ffff7634dc8>, exception=None, _target=None, _name='Thread-3', _args=(...), _kwargs={}, _daemonic=True, _ident=140737077307136, _tstate_lock=<_thread.lock at remote 0x7fffed63af58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7fffed63ad00>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fffed63ad00>, release=<built-in method release of _thread.lock object at remote 0x7fffed63ad00>, _waiters=<collections.deque at remote 0x7fffed6d5a70>) at remote 0x7fffed533470>, _flag=True) at remote 0x7fffed533438>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7ffff7629708>) at remote 0x7fffed533358>, callable=<function at remote 0x7ffff73ead90>) at ../Objects/call.c:904
>>> > #42 method_call (method=<optimized out>, args=<optimized out>, kwargs=<optimized out>, method=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/classobject.c:309
>>> > #43 0x00000000005dc4f6 in PyObject_Call (callable=<method at remote 0x7ffff7634b08>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:245
>>> > #44 0x0000000000617b63 in t_bootstrap (boot_raw=boot_raw@entry=0x7fffed63ae40) at ../Modules/_threadmodule.c:994
>>> > #45 0x000000000062dfe4 in pythread_wrapper (arg=<optimized out>) at ../Python/thread_pthread.h:174
>>> > #46 0x00007ffff7f6efa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
>>> > #47 0x00007ffff7ab54cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>> > ```
>>> > _______________________________________________
>>> > Dev mailing list -- dev@xxxxxxx
>>> > To unsubscribe send an email to dev-leave@xxxxxxx
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>>>
>>> _______________________________________________
>>> Dev mailing list -- dev@xxxxxxx
>>> To unsubscribe send an email to dev-leave@xxxxxxx
>>
>> --
>> Regards
>> Kefu Chai
--
Cheers,
Brad
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx