Can do
ceph -s: cluster: id: <fsid> health: HEALTH_OK services: mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 4d) mgr: ceph01.fblojp(active, since 25h), standbys: ceph03.futetp mds: 1/1 daemons up, 1 standby osd: 32 osds: 32 up (since 9d), 31 in (since 2d); 15 remapped pgs data: volumes: 1/1 healthy pools: 6 pools, 161 pgs objects: 2.23k objects, 8.1 GiB usage: 34 GiB used, 66 TiB / 66 TiB avail pgs: 278/6682 objects misplaced (4.160%) 146 active+clean 15 active+clean+remapped
full ceph osd df:
ceph04.ssc.wisc.edu> ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
1 hdd 0.27280 1.00000 279 GiB 1.2 GiB 889 MiB 15 KiB 289 MiB 278 GiB 0.41 8.15 18 up
27 hdd 0.27280 1.00000 279 GiB 1.3 GiB 955 MiB 17 KiB 394 MiB 278 GiB 0.47 9.33 23 up
29 hdd 0.27280 1.00000 279 GiB 1.1 GiB 873 MiB 3 KiB 253 MiB 278 GiB 0.39 7.78 19 up
31 hdd 0.27280 1.00000 279 GiB 110 MiB 22 MiB 10 KiB 88 MiB 279 GiB 0.04 0.76 18 up
28 hdd 0.81870 1.00000 838 GiB 3.3 GiB 2.4 GiB 26 MiB 890 MiB 835 GiB 0.39 7.80 58 up
33 ssd 0.36389 1.00000 373 GiB 781 MiB 750 MiB 0 B 31 MiB 372 GiB 0.20 4.05 23 up
4 hdd 0.27280 1.00000 279 GiB 1.8 GiB 1.4 GiB 6 KiB 440 MiB 278 GiB 0.64 12.62 30 up
32 ssd 0.36389 1.00000 373 GiB 541 MiB 511 MiB 0 B 31 MiB 372 GiB 0.14 2.81 30 up
0 hdd 2.72899 1.00000 2.7 TiB 975 MiB 673 MiB 5 KiB 302 MiB 2.7 TiB 0.03 0.67 20 up
2 hdd 2.72899 1.00000 2.7 TiB 1.9 GiB 1.2 GiB 3 KiB 700 MiB 2.7 TiB 0.07 1.36 17 up
3 hdd 2.72899 1.00000 2.7 TiB 1.4 GiB 1022 MiB 6 KiB 389 MiB 2.7 TiB 0.05 0.98 20 up
6 hdd 2.72899 1.00000 2.7 TiB 109 MiB 20 MiB 2 KiB 89 MiB 2.7 TiB 0.00 0.08 6 up
7 hdd 2.72899 1.00000 2.7 TiB 126 MiB 30 MiB 3 KiB 96 MiB 2.7 TiB 0.00 0.09 13 up
8 hdd 2.72899 1.00000 2.7 TiB 2.4 GiB 1.8 GiB 26 MiB 595 MiB 2.7 TiB 0.09 1.71 17 up
9 hdd 2.72899 1.00000 2.7 TiB 1.4 GiB 1.0 GiB 3 KiB 422 MiB 2.7 TiB 0.05 1.02 20 up
10 hdd 2.72899 1.00000 2.7 TiB 832 MiB 582 MiB 5 KiB 250 MiB 2.7 TiB 0.03 0.58 11 up
11 hdd 2.72899 1.00000 2.7 TiB 763 MiB 511 MiB 6 KiB 252 MiB 2.7 TiB 0.03 0.53 17 up
12 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 824 MiB 4 KiB 290 MiB 2.7 TiB 0.04 0.77 12 up
13 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 807 MiB 4 KiB 352 MiB 2.7 TiB 0.04 0.80 12 up
14 hdd 2.72899 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up
15 hdd 2.72899 1.00000 2.7 TiB 728 MiB 481 MiB 3 KiB 247 MiB 2.7 TiB 0.03 0.50 11 up
16 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 835 MiB 10 KiB 322 MiB 2.7 TiB 0.04 0.80 21 up
17 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 829 MiB 4 KiB 295 MiB 2.7 TiB 0.04 0.78 16 up
18 hdd 2.72899 1.00000 2.7 TiB 1.7 GiB 1.2 GiB 7 KiB 531 MiB 2.7 TiB 0.06 1.19 16 up
19 hdd 2.72899 1.00000 2.7 TiB 1.0 GiB 728 MiB 4 KiB 322 MiB 2.7 TiB 0.04 0.73 15 up
20 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 762 MiB 10 KiB 389 MiB 2.7 TiB 0.04 0.80 8 up
21 hdd 2.72899 1.00000 2.7 TiB 155 MiB 24 MiB 26 MiB 106 MiB 2.7 TiB 0.01 0.11 14 up
22 hdd 2.72899 1.00000 2.7 TiB 1.9 GiB 1.4 GiB 3 KiB 538 MiB 2.7 TiB 0.07 1.33 13 up
23 hdd 2.72899 1.00000 2.7 TiB 101 MiB 20 MiB 2 KiB 82 MiB 2.7 TiB 0.00 0.07 12 up
24 hdd 2.72899 1.00000 2.7 TiB 547 MiB 406 MiB 15 KiB 142 MiB 2.7 TiB 0.02 0.38 12 up
25 hdd 2.72899 1.00000 2.7 TiB 1.3 GiB 938 MiB 4 KiB 408 MiB 2.7 TiB 0.05 0.93 14 up
26 hdd 2.72899 1.00000 2.7 TiB 1.1 GiB 827 MiB 4 KiB 284 MiB 2.7 TiB 0.04 0.77 10 up
TOTAL 66 TiB 34 GiB 24 GiB 77 MiB 9.6 GiB 66 TiB 0.05
MIN/MAX VAR: 0.07/12.62 STDDEV: 0.17
ceph04.ssc.wisc.edu> iostat
Linux 5.4.151-1.el8.elrepo.x86_64 (ceph04.ssc.wisc.edu) 12/02/21 _x86_64_ (24 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.23 0.01 0.30 0.05 0.00 99.41
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdj 15.71 189.15 1087.20 899361787 5169458151
sdb 15.79 198.68 1087.20 944678058 5169458151
sda 0.16 18.13 1.40 86216232 6675304
sdg 0.19 20.00 3.23 95095360 15357700
sdh 0.18 19.60 2.17 93197576 10306048
sdm 0.17 19.59 1.62 93166944 7715396
sdk 0.17 19.17 0.98 91160628 4653560
sdn 0.18 20.91 0.48 99404428 2294820
sdf 0.16 17.77 0.03 84473116 148432
sdl 0.17 19.59 1.29 93146740 6140328
sdd 0.17 19.19 1.85 91257788 8816344
sdi 0.17 18.87 1.07 89719756 5068364
sdc 0.18 19.94 3.76 94810840 17890524
sde 0.16 17.64 0.04 83896044 170476
md127 0.02 2.51 0.00 11919016 0
md126 0.02 2.51 0.00 11919427 3106
md125 10.63 2.99 1090.95 14231587 5187271176
md124 0.02 2.09 0.00 9935088 1
dm-0 0.09 1.58 2.17 7504152 10306048
dm-1 0.01 0.04 0.04 187900 170476
dm-3 0.03 0.38 0.48 1802652 2294820
dm-2 0.01 0.02 0.03 103212 148432
dm-5 0.07 1.02 1.62 4826480 7715396
dm-4 0.05 0.71 1.07 3364572 5068364
dm-8 0.06 1.15 1.29 5468036 6140328
dm-6 0.05 0.87 0.98 4143684 4653560
dm-7 0.09 1.73 1.85 8211404 8816344
dm-9 0.15 2.61 3.76 12426216 17890524
dm-10 0.13 2.12 3.23 10063696 15357700
dm-11 0.07 0.95 1.40 4493496 6675304
/dev/sdn is where osd.14 is running. So, no it doesn't look like much activity is occuring on the disk versus sdj and sdb which are running the boot disks in RAID1.
Lastly, I apologize but I'm not sure how to find logs
specifically for one OSD?
Hi,
It would be good to have the full output. Does iostat show the backing device performing I/O? Additionally, what does ceph -s show for cluster state? Also, can you check the logs on that OSD, and see if anything looks abnormal?
David
On Thu, Dec 2, 2021 at 1:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:
Good morning David,
Assuming you need/want to see the data about the other 31 OSDs, 14 is showing:
ID
CLASS
WEIGHT
REWEIGHT
SIZE
RAW USE
DATA
OMAP
META
AVAIL
%USE
VAR
PGS
STATUS
14
hdd
2.72899
0
0 B
0 B
0 B
0 B
0 B
0 B
0
0
1
up
Zach
On 2021-12-01 5:20 PM, David Orman wrote:
What's "ceph osd df" show?
On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:
I wanted to swap out on existing OSD, preserve the number, and then remove the HDD that had it (osd.14 in this case) and give the ID of 14 to a new SSD that would be taking its place in the same node. First time ever doing this, so not sure what to expect.
I followed the instructions here, using the --replace flag.
However, I'm a bit concerned that the operation is taking so long in my test cluster. Out of 70TB in the cluster, only 40GB were in use. This is a relatively large OSD in comparison to others in the cluster (2.7TB versus 300GB for most other OSDs) and yet it's been 36 hours with the following status:
ceph04.ssc.wisc.edu> ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 14 ceph04.ssc.wisc.edu draining 1 True True 2021-11-30 15:22:23.469150+00:00Another note: I don't know why it has the "force = true" set; the command that I ran was just Ceph orch osd rm 14 --replace, without specifying --force. Hopefully not a big deal but still strange.
At this point is there any way to tell if it's still actually doing something, or perhaps it is hung? if it is hung, what would be the 'recommended' way to proceed? I know that I could just manually eject the HDD from the chassis and run the "ceph osd crush remove osd.14" command and then manually delete the auth keys, etc, but the documentation seems to state that this shouldn't be necessary if a ceph OSD replacement goes properly.
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx