Is it normal for a orch osd rm drain to take so long?

"Zach Heise (SSCC)" <heise@xxxxxxxxxxxx> · Wed, 1 Dec 2021 14:20:32 -0600



    I wanted to swap out on existing OSD, preserve the number, and
      then remove the HDD that had it (osd.14 in this case) and give the
      ID of 14 to a new SSD that would be taking its place in the same
      node. First time ever doing this, so not sure what to expect.
     I followed the instructions here,
      using the --replace flag.

    
    However, I'm a bit concerned that the operation is taking so long
      in my test cluster. Out of 70TB in the cluster, only 40GB were in
      use. This is a relatively large OSD in comparison to others in the
      cluster (2.7TB versus 300GB for most other OSDs) and yet it's been
      36 hours with the following status:
    ceph04.ssc.wisc.edu> ceph orch osd rm status
OSD_ID  HOST                 STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT                  
14      ceph04.ssc.wisc.edu  draining  1         True     True   2021-11-30 15:22:23.469150+00:00


    Another note: I don't know why it has the "force = true" set; the
      command that I ran was just Ceph
      orch osd rm 14 --replace, without specifying --force. Hopefully
      not a big deal but still strange.

    
    At this point is there any way to tell if it's still actually
      doing something, or perhaps it is hung? if it is hung, what would
      be the 'recommended' way to proceed? I know that I could just
      manually eject the HDD from the chassis and run the "ceph osd
      crush remove osd.14" command and then manually delete the auth
      keys, etc, but the documentation seems to state that this
      shouldn't be necessary if a ceph OSD replacement goes properly.

    
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx