Hey Andreas, thanks for the insights. Maybe a bit more background: We are running a variety of pools, the majority of data is stored on the "hdd" and "ssd" pools, which make use of the "ssd" and "hdd-big" (as in 3.5") classes. Andreas John <aj@xxxxxxxxxxx> writes: > On 22.09.20 22:09, Nico Schottelius wrote: > [...] >> All nodes are connected with 2x 10 Gbit/s bonded/LACP, so I'd expect at >> >> The disks in question are 3.5"/10TB/6 Gbit/s SATA disks connected to an >> H800 controller - so generally speaking I do not see a reasonable >> bottleneck here. > Yes, I should! I saw in your mail: > > > 1.) 1532 slow requests are blocked > 32 sec > 789 slow ops, oldest one blocked for 1949 sec, daemons > [osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.33,osd.35,osd.50]... > have slow ops. > > > An request that is blocked for > 32 sec is odd! Same goes for 1949 sec. > I my experience, they will never finish. Sometimes they go away with osd > restarts. Are those OSD the ones you relocated? We tried restarting some of the osds, however the slow ops are coming back soon after restart. And this is the most puzzling part: The move of the osds only affected PGs that are related to the "ssd" pool. While data was rebalancing, one hdd osd crashed and was restarted, but what we at the moment is that there are slow ops on a lot of osds: REQUEST_SLOW 4560 slow requests are blocked > 32 sec 1262 ops are blocked > 2097.15 sec 1121 ops are blocked > 1048.58 sec 602 ops are blocked > 524.288 sec 849 ops are blocked > 262.144 sec 407 ops are blocked > 131.072 sec 175 ops are blocked > 65.536 sec 144 ops are blocked > 32.768 sec osd.82 has blocked requests > 131.072 sec osds 1,9,11,19,28,44,45,48,58,72,73,84 have blocked requests > 262.144 sec osds 2,4,21,22,27,29,31,34,61 have blocked requests > 524.288 sec osds 15,20,32,52,55,62,71,74,79,83 have blocked requests > 1048.58 sec osds 5,6,7,12,14,16,18,25,33,35,47,50,51,69 have blocked requests > 2097.15 sec REQUEST_STUCK 1228 stuck requests are blocked > 4096 sec 330 ops are blocked > 8388.61 sec 898 ops are blocked > 4194.3 sec osds 3,23,56,59,60 have stuck requests > 4194.3 sec osds 30,46,49,63,64,65,66,68,70,75,85 have stuck requests > 8388.61 sec SLOW_OPS 2360 slow ops, oldest one blocked for 6517 sec, daemons [osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.16,osd.18,osd.19,osd.2]... have slow ops. We have checked DNS, MTU, network congestion via prometheus and on the network side nothing seems to be wrong. > 2.) client: 91 MiB/s rd, 28 MiB/s wr, 1.76k op/s rd, 686 op/s wr > recovery: 67 MiB/s, 17 objects/s > > 67 MB/sec is slower than a single rotational disk can deliver. Even 67 > + 91 MB/s is not much, especially not for an 85 OSD @ 10G cluster. The > ~2500 IOPS client I/O will translate to 7500 "net" IOPS with pook size > 3, maybe that is the limit. > > But I guess you already know that. But before tuning, you should > probably listen to Frank's advice about the placements (See other post). > ASAP the unknown OSDs come back, the speed will probably go up due to > parallelism. I am not sure whether after the long rebalance progress over some hours this is a good idea at the moment. What really looks wrong is the extreme long peering and activation times: data: pools: 12 pools, 3000 pgs objects: 35.03M objects, 133 TiB usage: 394 TiB used, 163 TiB / 557 TiB avail pgs: 5.667% pgs unknown 24.967% pgs not active 1365063/105076392 objects degraded (1.299%) 252605/105076392 objects misplaced (0.240%) 1955 active+clean 608 peering 170 unknown 59 activating 57 active+remapped+backfill_wait 35 activating+undersized 32 active+undersized+degraded 20 stale+peering 17 activating+undersized+degraded 9 active+remapped+backfilling 6 stale+active+clean 5 active+recovery_wait 4 active+undersized 4 activating+degraded 4 active+clean+scrubbing+deep 4 stale+activating 3 active+recovery_wait+degraded 3 active+undersized+degraded+remapped+backfill_wait 2 remapped+peering 1 active+recovery_wait+undersized+degraded 1 active+undersized+degraded+remapped+backfilling 1 active+remapped+backfill_toofull io: client: 34 MiB/s rd, 3.6 MiB/s wr, 1.08k op/s rd, 324 op/s wr recovery: 82 MiB/s, 20 objects/s Still debugging. It's impressive how the very simple task of moving 4 SSDs caused/causes such problems. I wonder (and suspect) that something else must be wrong here. We recently (some months ago) upgraded from luminous via mimic to nautilus, I will triple check if there are any changes that can cause these effects. -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx