I started now to iterate over all osds in the tree and some of the osds are completely unresponsive: [18:27:18] black1.place6:~# for osd in $(ceph osd tree | grep osd. | awk '{ print $4 }'); do echo $osd; ceph tell $osd injectargs '--osd-max-backfills 1'; done osd.20 osd.56 osd.62 osd.63 ^CTraceback (most recent call last): File "/usr/bin/ceph", line 1266, in <module> retval = main() File "/usr/bin/ceph", line 1182, in main prefix='get_command_descriptions') File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1459, in json_command inbuf, timeout, verbose) File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1329, in send_command_retry return send_command(*args, **kwargs) File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1361, in send_command cluster.osd_command, osdid, cmd, inbuf, timeout=timeout) File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1311, in run_in_thread t.join(timeout=timeout) File "/usr/lib/python3.7/threading.py", line 1036, in join self._wait_for_tstate_lock(timeout=max(timeout, 0)) File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt osd.64 osd.65 What's the best way to figure out why osd.63 does not react to the tell command? Best regards, Nico Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes: > Hello Stefan, > > Stefan Kooman <stefan@xxxxxx> writes: > >> Hi, >> >>> However as soon as we issue either of the above tell commands, it just >>> hangs. Furthermore when ceph tell hangs, pg are also becoming stuck in >>> "Activating" and "Peering" states. >>> >>> It seems to be related, as soon as we stop ceph tell (ctrl-c it), a few >>> minutes later the pgs are peered/active. >>> >>> We can reproduce this problem also with very busy osds, which have been >>> moved to another host - they also do not react to the ceph tell commands. >> >> Does this also happen when you issue a osd specific "tell", i.e. ceph >> tell osd.13 injectargs '--osd-max-backfills 4' >> >> Does this also happen when you loop over it one by one? > > It does hang for some of them, but if I "ping" / select specific OSDs, > this does not happen. > >>> Did anyone see this before and/or do you have a hint on how to debug >>> ceph tell as it is not a daemon on its own? >> >> IIRC I have seen this, but not in combination with PGs peering / >> activating. Has the config change become effective on alls OSDs: verify >> with ceph daemon osd.13 config get osd_max_backfills (for all OSDs) > > Just checked - most OSDs did not apply the new setting, setting it > explicitly on them works however. > > Best regards, > > Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx