Re: Is it normal for a orch osd rm drain to take so long?

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 2 Dec 2021 14:52:15 -0600

Hi,

It would be good to have the full output. Does iostat show the backing
device performing I/O? Additionally, what does ceph -s show for cluster
state? Also, can you check the logs on that OSD, and see if anything looks
abnormal?

David

On Thu, Dec 2, 2021 at 1:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:

> Good morning David,
>
> Assuming you need/want to see the data about the other 31 OSDs, 14 is
> showing:
> ID
> CLASS
> WEIGHT
> REWEIGHT
> SIZE
> RAW USE
> DATA
> OMAP
> META
> AVAIL
> %USE
> VAR
> PGS
> STATUS
> 14
> hdd
> 2.72899
> 0
> 0 B
> 0 B
> 0 B
> 0 B
> 0 B
> 0 B
> 0
> 0
> 1
> up
>
> Zach
>
> On 2021-12-01 5:20 PM, David Orman wrote:
>
> What's "ceph osd df" show?
>
> On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx>
> wrote:
>
>> I wanted to swap out on existing OSD, preserve the number, and then
>> remove the HDD that had it (osd.14 in this case) and give the ID of 14 to a
>> new SSD that would be taking its place in the same node. First time ever
>> doing this, so not sure what to expect.
>>
>> I followed the instructions here
>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd>,
>> using the --replace flag.
>>
>> However, I'm a bit concerned that the operation is taking so long in my
>> test cluster. Out of 70TB in the cluster, only 40GB were in use. This is a
>> relatively large OSD in comparison to others in the cluster (2.7TB versus
>> 300GB for most other OSDs) and yet it's been 36 hours with the following
>> status:
>>
>> ceph04.ssc.wisc.edu> ceph orch osd rm status
>> OSD_ID  HOST                 STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
>> 14      ceph04.ssc.wisc.edu  draining  1         True     True   2021-11-30 15:22:23.469150+00:00
>>
>>
>> Another note: I don't know why it has the "force = true" set; the command
>> that I ran was just Ceph orch osd rm 14 --replace, without specifying
>> --force. Hopefully not a big deal but still strange.
>>
>> At this point is there any way to tell if it's still actually doing
>> something, or perhaps it is hung? if it is hung, what would be the
>> 'recommended' way to proceed? I know that I could just manually eject the
>> HDD from the chassis and run the "ceph osd crush remove osd.14" command and
>> then manually delete the auth keys, etc, but the documentation seems to
>> state that this shouldn't be necessary if a ceph OSD replacement goes
>> properly.
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx