Re: Is it normal for a orch osd rm drain to take so long?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can do

ceph -s:
  cluster:
    id:     <fsid>
    health: HEALTH_OK
 
  services:
    mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 4d)
    mgr: ceph01.fblojp(active, since 25h), standbys: ceph03.futetp
    mds: 1/1 daemons up, 1 standby
    osd: 32 osds: 32 up (since 9d), 31 in (since 2d); 15 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 161 pgs
    objects: 2.23k objects, 8.1 GiB
    usage:   34 GiB used, 66 TiB / 66 TiB avail
    pgs:     278/6682 objects misplaced (4.160%)
             146 active+clean
             15  active+clean+remapped

full ceph osd df:

ceph04.ssc.wisc.edu> ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA      OMAP    META     AVAIL    %USE  VAR    PGS  STATUS
 1    hdd  0.27280   1.00000  279 GiB  1.2 GiB   889 MiB  15 KiB  289 MiB  278 GiB  0.41   8.15   18      up
27    hdd  0.27280   1.00000  279 GiB  1.3 GiB   955 MiB  17 KiB  394 MiB  278 GiB  0.47   9.33   23      up
29    hdd  0.27280   1.00000  279 GiB  1.1 GiB   873 MiB   3 KiB  253 MiB  278 GiB  0.39   7.78   19      up
31    hdd  0.27280   1.00000  279 GiB  110 MiB    22 MiB  10 KiB   88 MiB  279 GiB  0.04   0.76   18      up
28    hdd  0.81870   1.00000  838 GiB  3.3 GiB   2.4 GiB  26 MiB  890 MiB  835 GiB  0.39   7.80   58      up
33    ssd  0.36389   1.00000  373 GiB  781 MiB   750 MiB     0 B   31 MiB  372 GiB  0.20   4.05   23      up
 4    hdd  0.27280   1.00000  279 GiB  1.8 GiB   1.4 GiB   6 KiB  440 MiB  278 GiB  0.64  12.62   30      up
32    ssd  0.36389   1.00000  373 GiB  541 MiB   511 MiB     0 B   31 MiB  372 GiB  0.14   2.81   30      up
 0    hdd  2.72899   1.00000  2.7 TiB  975 MiB   673 MiB   5 KiB  302 MiB  2.7 TiB  0.03   0.67   20      up
 2    hdd  2.72899   1.00000  2.7 TiB  1.9 GiB   1.2 GiB   3 KiB  700 MiB  2.7 TiB  0.07   1.36   17      up
 3    hdd  2.72899   1.00000  2.7 TiB  1.4 GiB  1022 MiB   6 KiB  389 MiB  2.7 TiB  0.05   0.98   20      up
 6    hdd  2.72899   1.00000  2.7 TiB  109 MiB    20 MiB   2 KiB   89 MiB  2.7 TiB  0.00   0.08    6      up
 7    hdd  2.72899   1.00000  2.7 TiB  126 MiB    30 MiB   3 KiB   96 MiB  2.7 TiB  0.00   0.09   13      up
 8    hdd  2.72899   1.00000  2.7 TiB  2.4 GiB   1.8 GiB  26 MiB  595 MiB  2.7 TiB  0.09   1.71   17      up
 9    hdd  2.72899   1.00000  2.7 TiB  1.4 GiB   1.0 GiB   3 KiB  422 MiB  2.7 TiB  0.05   1.02   20      up
10    hdd  2.72899   1.00000  2.7 TiB  832 MiB   582 MiB   5 KiB  250 MiB  2.7 TiB  0.03   0.58   11      up
11    hdd  2.72899   1.00000  2.7 TiB  763 MiB   511 MiB   6 KiB  252 MiB  2.7 TiB  0.03   0.53   17      up
12    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   824 MiB   4 KiB  290 MiB  2.7 TiB  0.04   0.77   12      up
13    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   807 MiB   4 KiB  352 MiB  2.7 TiB  0.04   0.80   12      up
14    hdd  2.72899         0      0 B      0 B       0 B     0 B      0 B      0 B     0      0    1      up
15    hdd  2.72899   1.00000  2.7 TiB  728 MiB   481 MiB   3 KiB  247 MiB  2.7 TiB  0.03   0.50   11      up
16    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   835 MiB  10 KiB  322 MiB  2.7 TiB  0.04   0.80   21      up
17    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   829 MiB   4 KiB  295 MiB  2.7 TiB  0.04   0.78   16      up
18    hdd  2.72899   1.00000  2.7 TiB  1.7 GiB   1.2 GiB   7 KiB  531 MiB  2.7 TiB  0.06   1.19   16      up
19    hdd  2.72899   1.00000  2.7 TiB  1.0 GiB   728 MiB   4 KiB  322 MiB  2.7 TiB  0.04   0.73   15      up
20    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   762 MiB  10 KiB  389 MiB  2.7 TiB  0.04   0.80    8      up
21    hdd  2.72899   1.00000  2.7 TiB  155 MiB    24 MiB  26 MiB  106 MiB  2.7 TiB  0.01   0.11   14      up
22    hdd  2.72899   1.00000  2.7 TiB  1.9 GiB   1.4 GiB   3 KiB  538 MiB  2.7 TiB  0.07   1.33   13      up
23    hdd  2.72899   1.00000  2.7 TiB  101 MiB    20 MiB   2 KiB   82 MiB  2.7 TiB  0.00   0.07   12      up
24    hdd  2.72899   1.00000  2.7 TiB  547 MiB   406 MiB  15 KiB  142 MiB  2.7 TiB  0.02   0.38   12      up
25    hdd  2.72899   1.00000  2.7 TiB  1.3 GiB   938 MiB   4 KiB  408 MiB  2.7 TiB  0.05   0.93   14      up
26    hdd  2.72899   1.00000  2.7 TiB  1.1 GiB   827 MiB   4 KiB  284 MiB  2.7 TiB  0.04   0.77   10      up
                       TOTAL   66 TiB   34 GiB    24 GiB  77 MiB  9.6 GiB   66 TiB  0.05                    
MIN/MAX VAR: 0.07/12.62  STDDEV: 0.17


ceph04.ssc.wisc.edu> iostat
Linux 5.4.151-1.el8.elrepo.x86_64 (ceph04.ssc.wisc.edu)         12/02/21        _x86_64_        (24 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.01    0.30    0.05    0.00   99.41
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdj              15.71       189.15      1087.20  899361787 5169458151
sdb              15.79       198.68      1087.20  944678058 5169458151
sda               0.16        18.13         1.40   86216232    6675304
sdg               0.19        20.00         3.23   95095360   15357700
sdh               0.18        19.60         2.17   93197576   10306048
sdm               0.17        19.59         1.62   93166944    7715396
sdk               0.17        19.17         0.98   91160628    4653560
sdn               0.18        20.91         0.48   99404428    2294820
sdf               0.16        17.77         0.03   84473116     148432
sdl               0.17        19.59         1.29   93146740    6140328
sdd               0.17        19.19         1.85   91257788    8816344
sdi               0.17        18.87         1.07   89719756    5068364
sdc               0.18        19.94         3.76   94810840   17890524
sde               0.16        17.64         0.04   83896044     170476
md127             0.02         2.51         0.00   11919016          0
md126             0.02         2.51         0.00   11919427       3106
md125            10.63         2.99      1090.95   14231587 5187271176
md124             0.02         2.09         0.00    9935088          1
dm-0              0.09         1.58         2.17    7504152   10306048
dm-1              0.01         0.04         0.04     187900     170476
dm-3              0.03         0.38         0.48    1802652    2294820
dm-2              0.01         0.02         0.03     103212     148432
dm-5              0.07         1.02         1.62    4826480    7715396
dm-4              0.05         0.71         1.07    3364572    5068364
dm-8              0.06         1.15         1.29    5468036    6140328
dm-6              0.05         0.87         0.98    4143684    4653560
dm-7              0.09         1.73         1.85    8211404    8816344
dm-9              0.15         2.61         3.76   12426216   17890524
dm-10             0.13         2.12         3.23   10063696   15357700
dm-11             0.07         0.95         1.40    4493496    6675304

/dev/sdn is where osd.14 is running. So, no it doesn't look like much activity is occuring on the disk versus sdj and sdb which are running the boot disks in RAID1.

Lastly, I apologize but I'm not sure how to find logs specifically for one OSD?

Zach


On 2021-12-02 2:52 PM, David Orman wrote:
Hi,

It would be good to have the full output. Does iostat show the backing device performing I/O? Additionally, what does ceph -s show for cluster state? Also, can you check the logs on that OSD, and see if anything looks abnormal?

David

On Thu, Dec 2, 2021 at 1:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:

Good morning David,

Assuming you need/want to see the data about the other 31 OSDs, 14 is showing:

ID
CLASS
WEIGHT
REWEIGHT
SIZE
RAW USE
DATA
OMAP
META
AVAIL
%USE
VAR
PGS
STATUS
14
hdd
2.72899
0
0 B
0 B
0 B
0 B
0 B
0 B
0
0
1
up

Zach

On 2021-12-01 5:20 PM, David Orman wrote:
What's "ceph osd df" show?

On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:

I wanted to swap out on existing OSD, preserve the number, and then remove the HDD that had it (osd.14 in this case) and give the ID of 14 to a new SSD that would be taking its place in the same node. First time ever doing this, so not sure what to expect.

I followed the instructions here, using the --replace flag.

However, I'm a bit concerned that the operation is taking so long in my test cluster. Out of 70TB in the cluster, only 40GB were in use. This is a relatively large OSD in comparison to others in the cluster (2.7TB versus 300GB for most other OSDs) and yet it's been 36 hours with the following status:

ceph04.ssc.wisc.edu> ceph orch osd rm status
OSD_ID  HOST                 STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT                  
14      ceph04.ssc.wisc.edu  draining  1         True     True   2021-11-30 15:22:23.469150+00:00

Another note: I don't know why it has the "force = true" set; the command that I ran was just Ceph orch osd rm 14 --replace, without specifying --force. Hopefully not a big deal but still strange.

At this point is there any way to tell if it's still actually doing something, or perhaps it is hung? if it is hung, what would be the 'recommended' way to proceed? I know that I could just manually eject the HDD from the chassis and run the "ceph osd crush remove osd.14" command and then manually delete the auth keys, etc, but the documentation seems to state that this shouldn't be necessary if a ceph OSD replacement goes properly.


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux