Re: Unknown PGs after osd move

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Tue, 22 Sep 2020 22:51:39 +0200

Hey Andreas,

thanks for the insights. Maybe a bit more background:

We are running a variety of pools, the majority of data is stored on the
"hdd" and "ssd" pools, which make use of the "ssd" and "hdd-big" (as in
3.5") classes.

Andreas John <aj@xxxxxxxxxxx> writes:

> On 22.09.20 22:09, Nico Schottelius wrote:
> [...]
>> All nodes are connected with 2x 10 Gbit/s bonded/LACP, so I'd expect at
>>
>> The disks in question are 3.5"/10TB/6 Gbit/s SATA disks connected to an
>> H800 controller - so generally speaking I do not see a reasonable
>> bottleneck here.
> Yes, I should! I saw in your mail:
>
>
> 1.)        1532 slow requests are blocked > 32 sec
>             789 slow ops, oldest one blocked for 1949 sec, daemons
> [osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.33,osd.35,osd.50]...
> have slow ops.
>
>
> An request that is blocked for > 32 sec is odd! Same goes for 1949 sec.
> I my experience, they will never finish. Sometimes they go away with osd
> restarts. Are those OSD the ones you relocated?

We tried restarting some of the osds, however the slow ops are coming
back soon after restart. And this is the most puzzling part:

The move of the osds only affected PGs that are related to the "ssd"
pool. While data was rebalancing, one hdd osd crashed and was restarted,
but what we at the moment is that there are slow ops on a lot of osds:

REQUEST_SLOW 4560 slow requests are blocked > 32 sec
    1262 ops are blocked > 2097.15 sec
    1121 ops are blocked > 1048.58 sec
    602 ops are blocked > 524.288 sec
    849 ops are blocked > 262.144 sec
    407 ops are blocked > 131.072 sec
    175 ops are blocked > 65.536 sec
    144 ops are blocked > 32.768 sec
    osd.82 has blocked requests > 131.072 sec
    osds 1,9,11,19,28,44,45,48,58,72,73,84 have blocked requests > 262.144 sec
    osds 2,4,21,22,27,29,31,34,61 have blocked requests > 524.288 sec
    osds 15,20,32,52,55,62,71,74,79,83 have blocked requests > 1048.58 sec
    osds 5,6,7,12,14,16,18,25,33,35,47,50,51,69 have blocked requests > 2097.15 sec
REQUEST_STUCK 1228 stuck requests are blocked > 4096 sec
    330 ops are blocked > 8388.61 sec
    898 ops are blocked > 4194.3 sec
    osds 3,23,56,59,60 have stuck requests > 4194.3 sec
    osds 30,46,49,63,64,65,66,68,70,75,85 have stuck requests > 8388.61 sec
SLOW_OPS 2360 slow ops, oldest one blocked for 6517 sec, daemons [osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.16,osd.18,osd.19,osd.2]... have slow ops.

We have checked DNS, MTU, network congestion via prometheus and on the
network side nothing seems to be wrong.

> 2.) client:   91 MiB/s rd, 28 MiB/s wr, 1.76k op/s rd, 686 op/s wr
>     recovery: 67 MiB/s, 17 objects/s
>
> 67 MB/sec is slower than a single rotational disk can deliver.  Even 67
> + 91 MB/s is not much, especially not for an 85 OSD @ 10G cluster. The
> ~2500 IOPS client I/O will translate to 7500 "net" IOPS with pook size
> 3, maybe that is the limit.
>
> But I guess you already know that. But before tuning, you should
> probably listen to Frank's advice about the placements (See other post).
> ASAP the unknown OSDs come back, the speed will probably go up due to
> parallelism.

I am not sure whether after the long rebalance progress over some hours
this is a good idea at the moment.

What really looks wrong is the extreme long peering and activation
times:

  data:
    pools:   12 pools, 3000 pgs
    objects: 35.03M objects, 133 TiB
    usage:   394 TiB used, 163 TiB / 557 TiB avail
    pgs:     5.667% pgs unknown
             24.967% pgs not active
             1365063/105076392 objects degraded (1.299%)
             252605/105076392 objects misplaced (0.240%)
             1955 active+clean
             608  peering
             170  unknown
             59   activating
             57   active+remapped+backfill_wait
             35   activating+undersized
             32   active+undersized+degraded
             20   stale+peering
             17   activating+undersized+degraded
             9    active+remapped+backfilling
             6    stale+active+clean
             5    active+recovery_wait
             4    active+undersized
             4    activating+degraded
             4    active+clean+scrubbing+deep
             4    stale+activating
             3    active+recovery_wait+degraded
             3    active+undersized+degraded+remapped+backfill_wait
             2    remapped+peering
             1    active+recovery_wait+undersized+degraded
             1    active+undersized+degraded+remapped+backfilling
             1    active+remapped+backfill_toofull

  io:
    client:   34 MiB/s rd, 3.6 MiB/s wr, 1.08k op/s rd, 324 op/s wr
    recovery: 82 MiB/s, 20 objects/s

Still debugging. It's impressive how the very simple task of moving 4
SSDs caused/causes such problems. I wonder (and suspect) that something
else must be wrong here.

We recently (some months ago) upgraded from luminous via mimic to
nautilus, I will triple check if there are any changes that can cause
these effects.

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx