Re: zap an osd and it appears again

Luis Domingues <luis.domingues@xxxxxxxxx> · Tue, 26 Apr 2022 12:15:03 +0000

Hi all,

We got hit by the same bug while doing some testing with cephadm on a test cluster.

Version of ceph installed is 16.2.7, we have the orchestrator with cephadm but no dashboard.

We tried to remove an osd using ceph orch osd rm 2 --zap.

The osd was drained normally but right after the disk was zapped the orchestrator added the disk to the cluster:

2022-04-26T09:29:31.740508+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5143 : cephadm [INF] osd.2 crush weight is 0.4882965087890625
2022-04-26T09:29:32.678558+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5145 : cephadm [INF] osd.2 weight is now 0.0
2022-04-26T09:41:42.479593+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5548 : cephadm [INF] osd.2 now down
2022-04-26T09:41:42.479832+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5549 : cephadm [INF] Removing daemon osd.2 from ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:41:44.650366+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5551 : cephadm [INF] Removing key for osd.2
2022-04-26T09:41:44.661482+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5552 : cephadm [INF] Successfully removed osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:41:44.675287+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5553 : cephadm [INF] Successfully purged osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:41:44.675430+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5554 : cephadm [INF] Zapping devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:41:46.629752+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5556 : cephadm [INF] Successfully zapped devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:42:03.331285+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5565 : cephadm [INF] Deploying daemon osd.2 on ip-10-12-0-98.eu-central-1.compute.internal

And we did the same trying --replace on the command, so with this one: ceph orch osd rm 2 --replace --zap

Here it is not better as the osd was removed almost instantly juste before being re-added. Here is the cephadm log:

2022-04-26T09:55:21.478379+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5969 : cephadm [INF] osd.2 now out
2022-04-26T09:55:30.327466+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5982 : cephadm [INF] osd.2 now down
2022-04-26T09:55:30.327611+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5983 : cephadm [INF] Removing daemon osd.2 from ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:55:33.099252+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5986 : cephadm [INF] Removing key for osd.2
2022-04-26T09:55:33.117638+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5987 : cephadm [INF] Successfully removed osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:55:33.133074+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5988 : cephadm [INF] Successfully destroyed old osd.2 on ip-10-12-0-98.eu-central-1.compute.internal; ready for replacement
2022-04-26T09:55:33.133133+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5989 : cephadm [INF] Zapping devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:55:35.432259+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5991 : cephadm [INF] Successfully zapped devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:55:35.448361+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5992 : cephadm [INF] Found osd claims -> {'ip-10-12-0-98': ['2']}
2022-04-26T09:55:35.448466+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 5993 : cephadm [INF] Found osd claims for drivegroup default_drives -> {'ip-10-12-0-98': ['2']}
2022-04-26T09:55:54.573100+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 6004 : cephadm [INF] Deploying daemon osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T09:56:04.147451+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 6009 : cephadm [INF] Detected new or changed devices on ip-10-12-0-98.eu-central-1.compute.internal

We tried a last time using --force: ceph orch osd rm 2 --replace --zap --force. And we still have the same behavior.

2022-04-26T12:00:13.208680+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 9737 : cephadm [INF] osd.2 crush weight is 0.4882965087890625
2022-04-26T12:00:13.885870+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 9738 : cephadm [INF] osd.2 weight is now 0.0
2022-04-26T12:12:16.651546+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10127 : cephadm [INF] osd.2 now down
2022-04-26T12:12:16.651984+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10128 : cephadm [INF] Removing daemon osd.2 from ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T12:12:18.775694+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10130 : cephadm [INF] Removing key for osd.2
2022-04-26T12:12:18.785669+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10131 : cephadm [INF] Successfully removed osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T12:12:18.799167+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10132 : cephadm [INF] Successfully purged osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T12:12:18.799305+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10133 : cephadm [INF] Zapping devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T12:12:20.668898+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10135 : cephadm [INF] Successfully zapped devices for osd.2 on ip-10-12-0-98.eu-central-1.compute.internal
2022-04-26T12:12:37.813769+0000 mgr.ip-10-12-0-209.eu-central-1.compute.internal.pjsjcm (mgr.14116) 10144 : cephadm [INF] Deploying daemon osd.2 on ip-10-12-0-98.eu-central-1.compute.internal

This is quite bad, because when we want to zap a disk, we want at least the orchestrator to let the disk empty a least until it is unplugged.

Luis Domingues
Proton AG

------- Original Message -------
On Thursday, March 31st, 2022 at 09:32, Dhairya Parmar <dparmar@xxxxxxxxxx> wrote:

> Can you try using the --force option with your command?
>
> On Thu, Mar 31, 2022 at 1:25 AM Alfredo Rezinovsky alfrenovsky@xxxxxxxxx
>
> wrote:
>
> > I want to create osds manually
> >
> > If I zap the osd 0 with:
> >
> > ceph orch osd rm 0 --zap
> >
> > as soon as the dev is available the orchestrator creates it again
> >
> > If I use:
> >
> > ceph orch apply osd --all-available-devices --unmanaged=true
> >
> > and then zap the osd.0 it also appears again.
> >
> > There is a real way to disable the orch apply persistency or disable it
> > temporarily?
> >
> > --
> > Alfrenovsky
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx