Re: Replacing OSD with containerized deployment

"David Orman" <ormandj@xxxxxxxxxxxx> · Tue, 31 Jan 2023 05:35:21 -0600

What does your OSD service specification look like? Did your db/wall device show as having free space prior to the OSD creation?

On Tue, Jan 31, 2023, at 04:01, mailing-lists wrote:
> OK, the OSD is filled again. In and Up, but it is not using the nvme 
> WAL/DB anymore.
>
> And it looks like the lvm group of the old osd is still on the nvme 
> drive. I come to this idea, because the two nvme drives still have 9 lvm 
> groups each. 18 groups but only 17 osd are using the nvme (shown in 
> dashboard).
>
>
> Do you have a hint on how to fix this?
>
>
>
> Best
>
> Ken
>
>
>
> On 30.01.23 16:50, mailing-lists wrote:
>> oph wait,
>>
>> i might have been too impatient:
>>
>>
>> 1/30/23 4:43:07 PM[INF]Deploying daemon osd.232 on ceph-a1-06
>>
>> 1/30/23 4:42:26 PM[INF]Found osd claims for drivegroup 
>> dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:42:26 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:42:19 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:41:01 PM[INF]Found osd claims for drivegroup 
>> dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:41:01 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:41:01 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:41:00 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:39:34 PM[INF]Found osd claims for drivegroup 
>> dashboard-admin-1661788934732 -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:39:34 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>> 1/30/23 4:39:34 PM[INF]Found osd claims -> {'ceph-a1-06': ['232']}
>>
>>
>>
>> Although, it doesnt show the NVME as wal/db yet, but i will let it 
>> proceed to a clear state until i do anything further.
>>
>>
>> On 30.01.23 16:42, mailing-lists wrote:
>>> root@ceph-a2-01:/# ceph osd destroy 232 --yes-i-really-mean-it
>>> destroyed osd.232
>>>
>>>
>>> OSD 232 shows now as destroyed and out in the dashboard.
>>>
>>>
>>> root@ceph-a1-06:/# ceph-volume lvm zap /dev/sdm
>>> --> Zapping: /dev/sdm
>>> --> --destroy was not specified, but zapping a whole device will 
>>> remove the partition table
>>> Running command: /usr/bin/dd if=/dev/zero of=/dev/sdm bs=1M count=10 
>>> conv=fsync
>>>  stderr: 10+0 records in
>>> 10+0 records out
>>>  stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0675647 s, 155 MB/s
>>> --> Zapping successful for: <Raw Device: /dev/sdm>
>>>
>>>
>>> root@ceph-a2-01:/# ceph orch device ls
>>>
>>> ceph-a1-06  /dev/sdm      hdd   TOSHIBA_X_X 16.0T 21m ago *locked*
>>>
>>>
>>> It shows locked and is not automatically added now, which is good i 
>>> think? otherwise it would probably be a new osd 307.
>>>
>>>
>>> root@ceph-a2-01:/# ceph orch osd rm status
>>> No OSD remove/replace operations reported
>>>
>>> root@ceph-a2-01:/# ceph orch osd rm 232 --replace
>>> Unable to find OSDs: ['232']
>>>
>>>
>>> Unfortunately it is still not replacing.
>>>
>>>
>>> It is so weird, i tried this procedure exactly in my virtual ceph 
>>> environment and it just worked. The real scenario is acting up now. -.-
>>>
>>>
>>> Do you have more hints for me?
>>>
>>> Thank you for your help so far!
>>>
>>>
>>> Best
>>>
>>> Ken
>>>
>>>
>>> On 30.01.23 15:46, David Orman wrote:
>>>> The 'down' status is why it's not being replaced, vs. destroyed, 
>>>> which would allow the replacement. I'm not sure why --replace lead 
>>>> to that scenario, but you will probably need to mark it destroyed 
>>>> for it to be replaced.
>>>>
>>>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd 
>>>> has instructions on the non-orch way of doing that. You only need 1/2.
>>>>
>>>> You should look through your logs to see what happened that the OSD 
>>>> was marked down and not destroyed. Obviously, make sure you 
>>>> understand ramifications before running any commands. :)
>>>>
>>>> David
>>>>
>>>> On Mon, Jan 30, 2023, at 04:24, mailing-lists wrote:
>>>>> # ceph orch osd rm status
>>>>> No OSD remove/replace operations reported
>>>>> # ceph orch osd rm 232 --replace
>>>>> Unable to find OSDs: ['232']
>>>>>
>>>>> It is not finding 232 anymore. It is still shown as down and out in 
>>>>> the
>>>>> Ceph-Dashboard.
>>>>>
>>>>>
>>>>>       pgs:     3236 active+clean
>>>>>
>>>>>
>>>>> This is the new disk shown as locked (because unzapped at the moment).
>>>>>
>>>>> # ceph orch device ls
>>>>>
>>>>> ceph-a1-06  /dev/sdm      hdd   TOSHIBA_X_X 16.0T 9m ago
>>>>> locked
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Ken
>>>>>
>>>>>
>>>>> On 29.01.23 18:19, David Orman wrote:
>>>>>> What does "ceph orch osd rm status" show before you try the zap? Is
>>>>>> your cluster still backfilling to the other OSDs for the PGs that 
>>>>>> were
>>>>>> on the failed disk?
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Fri, Jan 27, 2023, at 03:25, mailing-lists wrote:
>>>>>>> Dear Ceph-Users,
>>>>>>>
>>>>>>> i am struggling to replace a disk. My ceph-cluster is not 
>>>>>>> replacing the
>>>>>>> old OSD even though I did:
>>>>>>>
>>>>>>> ceph orch osd rm 232 --replace
>>>>>>>
>>>>>>> The OSD 232 is still shown in the osd list, but the new hdd will be
>>>>>>> placed as a new OSD. This wouldnt mind me much, if the OSD was also
>>>>>>> placed on the bluestoreDB / NVME, but it doesn't.
>>>>>>>
>>>>>>>
>>>>>>> My steps:
>>>>>>>
>>>>>>> "ceph orch osd rm 232 --replace"
>>>>>>>
>>>>>>> remove the failed hdd.
>>>>>>>
>>>>>>> add the new one.
>>>>>>>
>>>>>>> Convert the disk within the servers bios, so that the node can have
>>>>>>> direct access on it.
>>>>>>>
>>>>>>> It shows up as /dev/sdt,
>>>>>>>
>>>>>>> enter maintenance mode
>>>>>>>
>>>>>>> reboot server
>>>>>>>
>>>>>>> drive is now /dev/sdm (which the old drive had)
>>>>>>>
>>>>>>> "ceph orch device zap node-x /dev/sdm"
>>>>>>>
>>>>>>> A new OSD is placed on the cluster.
>>>>>>>
>>>>>>>
>>>>>>> Can you give me a hint, where did I take a wrong turn? Why is the 
>>>>>>> disk
>>>>>>> not being used as OSD 232?
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Ken
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list --ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list --ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>>>> _______________________________________________
>>>>> ceph-users mailing list --ceph-users@xxxxxxx
>>>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list --ceph-users@xxxxxxx
>>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx