Re: Restart Error: osd.47 already exists in network host

Eugen Block <eblock@xxxxxx> · Tue, 03 Nov 2020 15:55:32 +0000

I'm not familiar with docker yet but apparently the cleanup doesn't  
work? Would something like this work?

docker network disconnect -f host $container

Maybe it's the same as 'docker network prune', I don't know, I also  
don't have a docker environment available and podman seems to work  
slightly different.

Zitat von Ml Ml <mliebherr99@xxxxxxxxxxxxxx>:

Hello Eugen,

cephadm ls for OSD.41:

   {
        "style": "cephadm:v1",
        "name": "osd.41",
        "fsid": "5436dd5d-83d4-4dc8-a93b-60ab5db145df",
        "systemd_unit": "ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41",
        "enabled": true,
        "state": "error",
        "container_id": null,
        "container_image_name": "docker.io/ceph/ceph:v15.2.5",
        "container_image_id": null,
        "version": null,
        "started": null,
        "created": "2020-07-28T12:42:17.292765",
        "deployed": "2020-10-21T11:29:36.284462",
        "configured": "2020-10-21T11:29:47.032038"
    },

root@ceph06:~# systemctl start
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
Job for ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
failed because the control process exited with error code.
See "systemctl status
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service" and
"journalctl -xe" for details.

● ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service - Ceph
osd.41 for 5436dd5d-83d4-4dc8-a93b-60ab5db145df
   Loaded: loaded
(/etc/systemd/system/ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@.service;
enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-11-02 10:56:50
CET; 9min ago
  Process: 430022 ExecStartPre=/usr/bin/docker rm
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 (code=exited,
status=1/FAILURE)
  Process: 430040 ExecStart=/bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.run
(code=exited, status=125)
  Process: 430159 ExecStopPost=/bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.poststop
(code=exited, status=0/SUCCESS)
 Main PID: 430040 (code=exited, status=125)
    Tasks: 51 (limit: 9830)
   Memory: 31.0M
   CGroup:  
/system.slice/system-ceph\x2d5436dd5d\x2d83d4\x2d4dc8\x2da93b\x2d60ab5db145df.slice/ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
           ├─224974 /bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.run
           └─225079 /usr/bin/docker run --rm --net=host --ipc=host
--privileged --group-add=disk --name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.5 -e NODE_NAME......

Nov 02 10:56:50 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:01:21 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:01:21 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:01:21 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:01:49 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:01:49 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:01:49 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:05:34 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:05:34 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:05:34 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.

If i run it manually, i get:
root@ceph06:~#  /usr/bin/docker run --rm --net=host --ipc=host
--privileged --group-add=disk --name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.5 -e NODE_NAME=ceph06 -v
/var/run/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df:/var/run/ceph:z -v
/var/log/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df:/var/log/ceph:z -v
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/crash:/var/lib/ceph/crash:z
-v  
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41:/var/lib/ceph/osd/ceph-41:z
-v  
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/config:/etc/ceph/ceph.conf:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm
-v /run/lock/lvm:/run/lock/lvm --entrypoint /usr/bin/ceph-osd
docker.io/ceph/ceph:v15.2.5 -n osd.41 -f --setuser ceph --setgroup
ceph --default-log-to-file=false --default-log-to-stderr=true
--default-log-stderr-prefix=debug
/usr/bin/docker: Error response from daemon: endpoint with name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 already exists in
network host.

Can you see a ContainerID Error here?

cluster id is: 5436dd5d-83d4-4dc8-a93b-60ab5db145df

On Mon, Nov 2, 2020 at 10:03 AM Eugen Block <eblock@xxxxxx> wrote:

Hi,

are you sure it's the right container ID you're using for the restart?
I noticed that 'cephadm ls' shows older containers after a daemon had
to be recreated (a MGR in my case). Maybe you're trying to restart a
daemon that was already removed?

Regards,
Eugen

Zitat von Ml Ml <mliebherr99@xxxxxxxxxxxxxx>:

> Hello List,
> sometimes some OSD get taken our for some reason ( i am still looking
> for the reason, and i guess its due to some overload), however, when i
> try to restart them i get:
>
> Nov 02 08:05:26 ceph05 bash[9811]: Error: No such container:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47
> Nov 02 08:05:29 ceph05 bash[9811]: /usr/bin/docker: Error response
> from daemon: endpoint with name
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47 already exists in
> network host.
> Nov 02 08:05:29 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Main process
> exited, code=exited, status=125/n/a
> Nov 02 08:05:34 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
> result 'exit-code'.
> Nov 02 08:05:44 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Service
> RestartSec=10s expired, scheduling restart.
> Nov 02 08:05:44 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Scheduled
> restart job, restart counter is at 5.
> Nov 02 08:05:44 ceph05 systemd[1]: Stopped Ceph osd.47 for
> 5436dd5d-83d4-4dc8-a93b-60ab5db145df.
> Nov 02 08:05:44 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Start
> request repeated too quickly.
> Nov 02 08:05:44 ceph05 systemd[1]:
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
> result 'exit-code'.
> Nov 02 08:05:44 ceph05 systemd[1]: Failed to start Ceph osd.47 for
> 5436dd5d-83d4-4dc8-a93b-60ab5db145df.
>
> I need to reboot the full host to get the OSD back in again. As far i
> can see this is some docker problem?
>
> root@ceph05:~# docker ps | grep osd.47 => no hit
> root@ceph05:~# docker network prune => does not solve the problem
> Any hint on that?
>
> Thanks,
> Michael
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx