Re: OSD stuck during a two-OSD drain

Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx> · Fri, 20 Dec 2024 09:58:21 +0200

Hi Nicola,

There have been reports of similar pg/osd behaviour with a single pg
remaining when draining. You can try repeering (or even moving with upmap)
the pg to a different OSD or restarting the primary OSD of the pg in
question to get things moving. Also check on the progress if its moving at
all - easiest with jj balancer (showremapped).

Although some might find it risky but we usually trust the failure domain
(disk/host/rack) and simply unplug the OSD when changing. Saves us time.

Best,
Laimis J.

On Fri, Dec 20, 2024, 09:51 Nicola Mori <mori@xxxxxxxxxx> wrote:

> Dear Ceph users,
>
> I'm upgrading some disks of my cluster (Squid 19.2.0 managed by cephadm,
> in which basically I have only a 6+2 EC pool over 12 hosts). To speed up
> the operations I issued a ceph orch osd rm --replace for two OSDs in two
> different hosts; the drain started for both and for one OSD finished
> smoothly and it is now in destroyed state. But for the second OSD it
> stopped with a single PG remaining to be moved away before the OSD is
> completely drained:
>
> # ceph orch osd rm status
> OSD  HOST     STATE     PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT
>
> 31   rokanan  draining    1  True     False  False  2024-12-19
> 08:57:36.458704+00:00
>
> and there is no backfill activity going on, even i f the PG is labeled
> as backfilling :
>
> # ceph -s
>    cluster:
>      id:     b1029256-7bb3-11ec-a8ce-ac1f6b627b45
>      health: HEALTH_WARN
>              52 pgs not deep-scrubbed in time
>              (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT)
>
>    services:
>      mon: 5 daemons, quorum bofur,fili,aka,bifur,romolo (age 7d)
>      mgr: fili.olevnm(active, since 18h), standbys: bofur.tklnrn,
> bifur.htimkf
>      mds: 2/2 daemons up, 1 standby
>      osd: 124 osds: 123 up (since 4h), 122 in (since 22h); 1 remapped pgs
>
>    data:
>      volumes: 1/1 healthy
>      pools:   3 pools, 529 pgs
>      objects: 27.11M objects, 78 TiB
>      usage:   104 TiB used, 162 TiB / 266 TiB avail
>      pgs:     53120/216457202 objects misplaced (0.025%)
>               302 active+clean
>               178 active+clean+scrubbing
>               48  active+clean+scrubbing+deep
>               1   active+remapped+backfilling
>
>
> Is all of the above normal? I guessed that maybe only one destroyed OSD
> at once can exist in the cluster, and that after replacing its disk and
> recreating it the drain for the second one would resume and finish, is
> this plausible?
> Thanks,
>
> Nicola
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx