Re: Cephadm cannot aquire lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi David,

It looks like we are affected by the same bug, thanks for the hint.

We're running pacific 16.2.0, and I'm looking forward to upgrading to the last pacific version, but the last upgrade I tried was not successful. In hindsight, it was the same bug causing the problem.

Now, my (naïve) upgrade strategy would be launching the upgrade process with the orchestrator, and killing the stuck cephadm process whenever it shows up. Assuming that there are not going to be changes in devices, I think It's going to work.

Thanks again, kind regards.

On 02/09/2021 15:03, David Orman wrote:
It may be this:

https://tracker.ceph.com/issues/50526
https://github.com/alfredodeza/remoto/issues/62

Which we resolved with: https://github.com/alfredodeza/remoto/pull/63

What version of ceph are you running, and is it impacted by the above?

David

On Thu, Sep 2, 2021 at 9:53 AM fcid <fcid@xxxxxxxxxxx> wrote:
Hi Sebastian,

Following your sugestion, I've found this process:

/usr/bin/python3
/var/lib/ceph/<FSID>/cephadm.f77d9d71514a634758d4ad41ab6eef36d25386c99d8b365310ad41f9b74d5ce6
--image
ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a
ceph-volume --fsid <FSID> -- lvm list --format json

That process have been running for more than 12 hours, so I killed it
and then cephadm could aquire lock. Shortly after the process starts
again and I can see that it is running on all the nodes (we have 3
nodes). I tried executing the same sentence in all the nodes, from the
command line, and it works fine, here is the output
https://pastebin.com/v58Nyxdx.

What can be causing this process to be stuck when it is launched by the
orchestrator, since launching it from the command line works fine?

Thank you, kind regards.

On 02/09/2021 05:19, Sebastian Wagner wrote:
Am 31.08.21 um 04:05 schrieb fcid:
Hi ceph community,

I'm having some trouble trying to delete an OSD.

I've been using cephadm in one of our clusters and it's works fine,
but lately, after an OSD failure, I cannot delete it using the
orchestrator. Since the orchestrator is not working (for some unknown
reason) I tried to manually delete the OSD using the following command:

ceph purge osd <id> --yes-i-really-mean-it

This command removed the OSD from the crush map, but then the warning
CEPHADM_FAILED_DEAMON appeared. So the next step is delete de daemon
in the server that use to host the failed OSD. The command I used
here was the following:

cephadm rm-daemon --name osd.<id> --fsid <FSID>

But this command does not work because, accoding to the log, cephadm
cannot aquire lock:

2021-08-30 21:50:09,712 DEBUG Lock 139899822730784 not acquired on
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...
2021-08-30 21:50:09,762 DEBUG Acquiring lock 139899822730784 on
/run/cephadm/$FSID.lock
2021-08-30 21:50:09,763 DEBUG Lock 139899822730784 not acquired on
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...

The file /run/cephadm/$FSID.lock does exist. Can I safely remove it?
What should I check before doing such task.
Yes, in case you're sure that no other cephadm process (i.e. call
`ps`) is stuck.

I'll really appreciate any hint you can give relating this matter.

Thanks! regards.

--
AltaVoz <https://www.altavoz.net/>
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net <https://www.altavoz.net/>
Ubicación AltaVoz
Viña del Mar: 2 Poniente 355 of 53
<https://www.altavoz.net/altavoz/contacto> | +56 32 276 8060
<tel:+56322768060>
Santiago: Antonio Bellet 292 of 701
<https://www.altavoz.net/altavoz/contacto> | +56 2 2585 4264
<tel:+562225854264>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
AltaVoz <https://www.altavoz.net/> 	
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net <https://www.altavoz.net/>
Ubicación AltaVoz 	
Viña del Mar: 2 Poniente 355 of 53 <https://www.altavoz.net/altavoz/contacto> | +56 32 276 8060 <tel:+56322768060> Santiago: Antonio Bellet 292 of 701 <https://www.altavoz.net/altavoz/contacto> | +56 2 2585 4264 <tel:+562225854264>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux