Good morning, you might have seen my previous mails and I wanted to discuss some findings over the last day+night over what happened and why it happened here. As the system behaved inexplicitly for us, we are now looking for someone to analyse the root cause on consultancy basis - if you are interested & available, please drop me a PM. What happened ------------- Yesterday at about 1520 we moved 4 SSDs from server15 to server8. These SSDs are located in 3 SSD only pools. Soon after that the cluster begun to exhibit slow I/O on clients and saw many PGs in the peering state. Additionally we began to see slow ops on unrelated OSDs, i.e. osds that have a device class "hdd-big" set and are only selected by the "hdd" pool. Excerpt from ceph -s: 167 slow ops, oldest one blocked for 958 sec, daemons [osd.0,osd.11,osd.12,osd.14,osd.15,osd. 16,osd.19,osd.2,osd.22,osd.25]... have slow ops. All pgs that were hanging in peering state seemed to be related to pools from the moved SSDs. However the question remains why other osds began to be affected. Then we noticed that the newly started OSDs on server8 are consuming 100% cpu, the associated disks have a queue depth of 1 and the disks are 80-99% busy. This is a change that we noticed some weeks ago in the nautilus osd behaviour: while in luminous, osds would start right away, osds often take seconds to minutes to start in nautilus, accompanied with high cpu usage. Before yesterday however, the maximum startup time was in the range of minutes. When we noticed blocked/slow IOPS, we wanted to ensure that the clients are least possible affected. We tried to submit the two changes to all osds: ceph tell osd.* injectargs '--osd-recovery-max-active 1' ceph tell osd.* injectargs '--osd-max-backfills 1' However at this time, ceph tell would hang. We then switched to address each OSD individually using a for loop: for osd in $(ceph osd tree | grep osd. | awk '{ print $4 }'); do echo $osd; ( ceph tell $osd injectargs '--osd-max-backfills 1' &) ; done However this resulted in many hanging ceph tell processes and some of them began to report the following error: 2020-09-22 15:59:25.407 7fbc0359e700 0 --1- [2a0a:e5c0:2:1:20d:b9ff:fe48:3bd4]:0/1280010984 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6856/4078 conn(0x7fbbec02edc0 0x7fbbec02f5d0 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER This message would be repeated many times a second per ceph tell process until we SIGINT killed it. After this failed we noticed that the remaining HDDs in server15 were at 100% i/o usage at about 0.6-2 MB/s reading rate. These low rates would very well explain the slow I/O on the clients - but it made us wonder why the rates were suddenly so low. We restarted one of the OSDs on this server, but the effect was that the osd would now use 100% cpu additionally to fully utilising the disk. We began to suspected a hardware problem and moved all HDDs to server8. Unfortunately, all newly HDD OSDs behave exactly the same: Very high cpu usage (60-100%), very high disk utilisation (about 90%+) and very low transfer rates (~1 MB/s on average). After a few hours (!) some of the OSDs normalised to 20-50MB/s read rates and after about 8h later all of them normalised. What is curious that during the slow phase of many hours, the average queue depth of the disk consistently stays at 1, while afterwards they usually average around 10-15. Given full utilisation, we suspect that the nautilus OSDs cause severly disk seeks. Our expectation (and so far behaviour with Luminous) was that if we have a single host failure, setting max backfills and max recovery to 1 and keeping the osd_recovery_op_priority low (usually at 2 in our clusters) has some, but more a minor impact on client I/O. This time, purely moving 4 osds with existing data made the cluster practically unusable, something we never experienced in a Luminous setup before. While I do not want to render out any potential mistakes or incorrect designs from our side, the effects seen are not expected from our side. While all of this happening we measured the network bandwidth on all nodes without congestion, monitored the monitors with some (spikes up to 50% from one core), but not high cpu usage. I was wondering whether anyone on the list has an insight on these questions: - Why were OSDs slow that are used in a different pool? - What changed in the OSDs from Luminous to Nautilus to make the startup phase slow? - What are nautilus OSDs doing at startup and why do they consume a lot of CPU? - What are the OSDs doing to severly impact the I/O? Is our seek theory correct? - Why does / can ceph tell hang in Nautilus? We never experienced this problem in Luminous. - Where does the BADAUTHORIZER message come from and what is the fix for it? - How can we debug / ensure that ceph tell does not hang? - Did code in the osds change that de-prioritizes command traffic? Best regards, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx