Re: Replacing OSD with containerized deployment

mailing-lists <mailing-lists@xxxxxxxxx> · Tue, 31 Jan 2023 11:01:06 +0100

OK, the OSD is filled again. In and Up, but it is not using the nvme 
WAL/DB anymore.

And it looks like the lvm group of the old osd is still on the nvme 
drive. I come to this idea, because the two nvme drives still have 9 lvm 
groups each. 18 groups but only 17 osd are using the nvme (shown in 
dashboard).

Do you have a hint on how to fix this?

Best

Ken

On 30.01.23 16:50, mailing-lists wrote:
oph wait,

i might have been too impatient:

1/30/23 4:43:07 PM[INF]Deploying daemon osd.232 on ceph-a1-06

1/30/23 4:42:26 PM[INF]Found osd claims for drivegroup 
dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}

1/30/23 4:42:26 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:42:19 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:41:01 PM[INF]Found osd claims for drivegroup 
dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}

1/30/23 4:41:01 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:41:01 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:41:00 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:39:34 PM[INF]Found osd claims for drivegroup 
dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}

1/30/23 4:39:34 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

1/30/23 4:39:34 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}

Although, it doesnt show the NVME as wal/db yet, but i will let it 
proceed to a clear state until i do anything further.

On 30.01.23 16:42, mailing-lists wrote:
root@ceph-a2-01:/# ceph osd destroy 232 --yes-i-really-mean-it
destroyed osd.232

OSD 232 shows now as destroyed and out in the dashboard.

root@ceph-a1-06:/# ceph-volume lvm zap /dev/sdm
--> Zapping: /dev/sdm
--> --destroy was not specified, but zapping a whole device will 
remove the partition table
Running command: /usr/bin/dd if=/dev/zero of=/dev/sdm bs=1M count=10 
conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0675647 s, 155 MB/s
--> Zapping successful for: <Raw Device: /dev/sdm>

root@ceph-a2-01:/# ceph orch device ls

ceph-a1-06  /dev/sdm      hdd   TOSHIBA_X_X 16.0T 21m ago *locked*

It shows locked and is not automatically added now, which is good i 
think? otherwise it would probably be a new osd 307.

root@ceph-a2-01:/# ceph orch osd rm status
No OSD remove/replace operations reported

root@ceph-a2-01:/# ceph orch osd rm 232 --replace
Unable to find OSDs: ['232']

Unfortunately it is still not replacing.

It is so weird, i tried this procedure exactly in my virtual ceph 
environment and it just worked. The real scenario is acting up now. -.-

Do you have more hints for me?

Thank you for your help so far!

Best

Ken

On 30.01.23 15:46, David Orman wrote:
The 'down' status is why it's not being replaced, vs. destroyed, 
which would allow the replacement. I'm not sure why --replace lead 
to that scenario, but you will probably need to mark it destroyed 
for it to be replaced.

https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd 
has instructions on the non-orch way of doing that. You only need 1/2.

You should look through your logs to see what happened that the OSD 
was marked down and not destroyed. Obviously, make sure you 
understand ramifications before running any commands. :)

David

On Mon, Jan 30, 2023, at 04:24, mailing-lists wrote:
# ceph orch osd rm status
No OSD remove/replace operations reported
# ceph orch osd rm 232 --replace
Unable to find OSDs: ['232']

It is not finding 232 anymore. It is still shown as down and out in 
the
Ceph-Dashboard.

      pgs:     3236 active+clean

This is the new disk shown as locked (because unzapped at the moment).

# ceph orch device ls

ceph-a1-06  /dev/sdm      hdd   TOSHIBA_X_X 16.0T 9m ago
locked

Best

Ken

On 29.01.23 18:19, David Orman wrote:
What does "ceph orch osd rm status" show before you try the zap? Is
your cluster still backfilling to the other OSDs for the PGs that 
were
on the failed disk?

David

On Fri, Jan 27, 2023, at 03:25, mailing-lists wrote:
Dear Ceph-Users,

i am struggling to replace a disk. My ceph-cluster is not 
replacing the
old OSD even though I did:

ceph orch osd rm 232 --replace

The OSD 232 is still shown in the osd list, but the new hdd will be
placed as a new OSD. This wouldnt mind me much, if the OSD was also
placed on the bluestoreDB / NVME, but it doesn't.

My steps:

"ceph orch osd rm 232 --replace"

remove the failed hdd.

add the new one.

Convert the disk within the servers bios, so that the node can have
direct access on it.

It shows up as /dev/sdt,

enter maintenance mode

reboot server

drive is now /dev/sdm (which the old drive had)

"ceph orch device zap node-x /dev/sdm"

A new OSD is placed on the cluster.

Can you give me a hint, where did I take a wrong turn? Why is the 
disk
not being used as OSD 232?

Best

Ken

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx