Re: [EXTERNAL] Re: How to Speed Up Draining OSDs?

"Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx> · Mon, 21 Oct 2024 14:38:42 +0000

My pool size is indeed 3. Operator error 🙂

Thanks again,
Alex

________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Monday, October 21, 2024 3:08 PM
To: Alex Hussein-Kershaw (HE/HIM) <alexhus@xxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re: [EXTERNAL]  Re: How to Speed Up Draining OSDs?

If your pool size is three then no, you can't get it to two OSDs. You
can check (and paste) 'ceph osd pool ls detail' output to see the
current value. (I wouldn't recommend to switch to size 2 except in
test clusters.)

Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:

> Hi Eugen,
>
> Thanks for the suggestion. I've repeated my attempt with the wpq
> scheduler (I ran "ceph config set osd osd_op_queue wpq" and
> restarted all the OSDs).
>
> That still seems to be either slow or stuck in a draining state - 10
> mins elapsed draining for just a few MB of data.
>
> $ ceph orch osd rm status ; date
> OSD  HOST         STATE     PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
> 2    raynor-sc-2  draining  117  False    False  True  2024-10-21
> 13:48:52.559054
>
> Mon Oct 21 13:59:33 UTC 2024
>
> $ ceph osd df
> ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA    OMAP     META
>     AVAIL   %USE  VAR   PGS  STATUS
>  0    hdd  0.01459   1.00000  15 GiB   64 MiB  22 MiB      0 B   42
> MiB  15 GiB  0.42  1.14  117      up
>  2    hdd        0   1.00000  15 GiB   52 MiB  22 MiB    2 KiB   30
> MiB  15 GiB  0.34  0.93  117      up
>  3    hdd  0.01459   1.00000  15 GiB   52 MiB  21 MiB    7 KiB   32
> MiB  15 GiB  0.34  0.93  117      up
>                        TOTAL  45 GiB  169 MiB  66 MiB  9.5 KiB  104
> MiB  45 GiB  0.37
> MIN/MAX VAR: 0.93/1.14  STDDEV: 0.04
>
> $ ceph -s
>   cluster:
>     id:     e773d9c2-6d8d-4413-8e8f-e38f248f5959
>     health: HEALTH_OK
>
>   services:
>     mon: 2 daemons, quorum raynor-sc-1,raynor-sc-3 (age 7m)
>     mgr: raynor-sc-1.hjpano(active, since 10m), standbys: raynor-sc-3.grmovv
>     mds: 1/1 daemons up, 1 standby
>     osd: 3 osds: 3 up (since 10m), 3 in (since 74m); 117 remapped pgs
>     rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   9 pools, 117 pgs
>     objects: 250 objects, 479 KiB
>     usage:   181 MiB used, 45 GiB / 45 GiB avail
>     pgs:     250/750 objects misplaced (33.333%)
>              117 active+clean+remapped
>
> Interesting that the cluster thinks 33% of PGs are misplaced, that
> seems to me to imply that it's stuck rather than slow. I wonder if
> it's actually possible to drop below 3 OSDs in this manner?
>
> Thanks,
> Alex
> ________________________________
> From: Eugen Block <eblock@xxxxxx>
> Sent: Monday, October 21, 2024 2:20 PM
> To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: [EXTERNAL]  Re: How to Speed Up Draining OSDs?
>
> Hi,
>
> for a production cluster I'd recommend sticking to wpq at the moment,
> where you can apply "legacy" recovery settings. If you're willing to
> help the Devs figuring out how to get to the bottom of this, I'm sure
> they would highly appreciate it. But I currently know too little about
> mclock to know the right knobs. So far I've only tried it with only a
> few different settings and none helped significantly.
>
> I would expect that there are existing tracker issues since this topic
> comes up every other week or so. If not, I'd suggest to create one.
>
> Thanks,
> Eugen
>
> Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:
>
>> Hi Folks,
>>
>> I'm trying to scale-in a Ceph Cluster. It's running 19.2.0 and is
>> cephadm managed. It's just a test system, so has basically no data
>> and only has 3 OSDs.
>>
>> As part of the scaling-in, I run "ceph orch host drain <hostname>
>> --zap-osd-devices" as per Host Management — Ceph
>> Documentation<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Freef%2Fcephadm%2Fhost-management%2F%23removing-hosts&data=05%7C02%7Calexhus%40microsoft.com%7C25073b093ea1418ea2ff08dcf1d9e55f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638651165403260249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=wpvpfyNZG328tnuYf6zuUHiivWZ%2FgzO0Y09Xvn6qAcg%3D&reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Freef%2Fcephadm%2Fhost-management%2F%23removing-hosts&data=05%7C02%7Calexhus%40microsoft.com%7C25073b093ea1418ea2ff08dcf1d9e55f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638651165403279909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=HB5BzHYUrYRgDEjZUTLF1y92kvHOT9wEuCIBzdVytMw%3D&reserved=0<https://docs.ceph.com/en/reef/cephadm/host-management/#removing-hosts>>>. That starts off the
>> OSD
>> draining.
>>
>> However, that drain seems to take an enormous amount of time. My OSD
>> has less than 100MiB raw storage, and I let it run for 2 hours over
>> lunch and it still was not finished, so I cancelled it.
>>
>> I'm not sure how this scales, but I'm assuming at least linearly
>> with data stored, which seems like bad news for doing this on real
>> systems, which may have several TBs per OSD.
>>
>> I had a look at the recovery profiles documentation here mClock
>> Config Reference — Ceph
>> Documentation<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Freef%2Frados%2Fconfiguration%2Fmclock-config-ref%2F&data=05%7C02%7Calexhus%40microsoft.com%7C25073b093ea1418ea2ff08dcf1d9e55f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638651165403290304%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=hxCreY6EE4U3t1w4lGlubnHvsCwx6zop5gl8qylpAuE%3D&reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Freef%2Frados%2Fconfiguration%2Fmclock-config-ref%2F&data=05%7C02%7Calexhus%40microsoft.com%7C25073b093ea1418ea2ff08dcf1d9e55f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638651165403300397%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NBZOi4iJyWoIFKLmNXcHsllOhX590VulINna0fGUSqg%3D&reserved=0<https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/>>> which seemed to indicate I could speed this up (but my impression was maybe I could get a speed up of 2x which seems like it will still take
>> an
>> age).
>>
>> On the other hand, just switching off the host running the OSD and
>> doing an offline host removal ("ceph orch host rm <hostname>
>> --offline") seems much easier, with the trade-off that the Cluster
>> recovers after the loss of the OSD rather than pre-emptively. But
>> that big risk of that seems to be mitigated by "ceph orch host
>> ok-to-stop <hostname>" to check I won't cause any PGs to go offline
>> before hand.
>>
>> Are there any tricks here that I'm missing?
>>
>> Thanks,
>> Alex
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx