Re: OSD stuck down

Dario Graña <dgrana@xxxxxx> · Thu, 15 Jun 2023 11:23:14 +0200

Hi, I have seen this behaviour when the OSD host cluster interface was down
but the public interface was up. I suggest checking the network interfaces
and the connectivity.

Regards!

On Thu, Jun 15, 2023 at 11:08 AM Nicola Mori <mori@xxxxxxxxxx> wrote:

> I have restarted all the monitors and managers, but still the osd
> remains down. But I found that cephadm actually sees -it running:
>
> # ceph orch ps | grep osd.34
> osd.34                     balin                 running (14m)   108s
> ago   8M    75.3M     793M  17.2.6   b1a23658afad  5b9dbea262c7
>
> # ceph osd tree | grep 34
>   34    hdd    1.81940          osd.34      down         0  1.00000
>
>
> I really need help with this since I don't know what more to look at.
> Thanks in advance,
>
> Nicola
>
>
> On 13/06/23 08:35, Nicola Mori wrote:
> > Dear Ceph users,
> >
> > after a host reboot one of the OSDs is now stuck down (and out). I tried
> > several times to restart it and even to reboot the host, but it still
> > remains down.
> >
> > # ceph -s
> >    cluster:
> >      id:     b1029256-7bb3-11ec-a8ce-ac1f6b627b45
> >      health: HEALTH_WARN
> >              4 OSD(s) have spurious read errors
> >              (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT)
> >
> >    services:
> >      mon: 5 daemons, quorum bofur,balin,aka,romolo,dwalin (age 16h)
> >      mgr: bofur.tklnrn(active, since 16h), standbys: aka.wzystq,
> > balin.hvunfe
> >      mds: 2/2 daemons up, 1 standby
> >      osd: 104 osds: 103 up (since 16h), 103 in (since 13h); 4 remapped
> pgs
> >
> >    data:
> >      volumes: 1/1 healthy
> >      pools:   3 pools, 529 pgs
> >      objects: 18.85M objects, 41 TiB
> >      usage:   56 TiB used, 139 TiB / 195 TiB avail
> >      pgs:     68130/150150628 objects misplaced (0.045%)
> >               522 active+clean
> >               4   active+remapped+backfilling
> >               3   active+clean+scrubbing+deep
> >
> >    io:
> >      recovery: 46 MiB/s, 21 objects/s
> >
> >
> >
> > The host is reachable (its other OSDs are in) and from the systemd logs
> > of the OSD I don't see anything wrong:
> >
> > $ sudo systemctl status ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34
> > ● ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service - Ceph
> osd.34
> > for b1029256-7bb3-11ec-a8ce-ac1f6b627b45
> >     Loaded: loaded
> > (/etc/systemd/system/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@.service;
>
> > enabled; vendor preset: disabled)
> >     Active: active (running) since Mon 2023-06-12 17:00:25 CEST; 15h ago
> >   Main PID: 36286 (bash)
> >      Tasks: 11 (limit: 152154)
> >     Memory: 20.0M
> >     CGroup:
> >
> /system.slice/system-ceph\x2db1029256\x2d7bb3\x2d11ec\x2da8ce\x2dac1f6b627b45.slice/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service
> >             ├─36286 /bin/bash
> > /var/lib/ceph/b1029256-7bb3-11ec-a8ce-ac1f6b627b45/osd.34/unit.run
> >             └─36657 /usr/bin/docker run --rm --ipc=host
> > --stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph-osd
> > --privileged --group-add=disk --init --name
> > ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45-osd-34 --pids-limit=0 -e
> > CONTAINER_IMAGE=snack14/ceph-wizard@sha>
> >
> > Jun 12 17:00:25 balin systemd[1]: Started Ceph osd.34 for
> > b1029256-7bb3-11ec-a8ce-ac1f6b627b45.
> > Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R
> > ceph:ceph /var/lib/ceph/osd/ceph-34
> > Jun 12 17:00:27 balin bash[36306]: Running command:
> > /usr/bin/ceph-bluestore-tool prime-osd-dir --path
> > /var/lib/ceph/osd/ceph-34 --no-mon-config --dev
> >
> /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
> > Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -h
> > ceph:ceph
> >
> /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
> > Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R
> > ceph:ceph /dev/dm-6
> > Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/ln -s
> >
> /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
> /var/lib/ceph/osd/ceph-34/block
> > Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R
> > ceph:ceph /var/lib/ceph/osd/ceph-34
> > Jun 12 17:00:27 balin bash[36306]: --> ceph-volume raw activate
> > successful for osd ID: 34
> > Jun 12 17:00:29 balin bash[36657]: debug 2023-06-12T15:00:29.066+0000
> > 7f818e356540 -1 Falling back to public interface
> >
> >
> > I'd need some help to understand how to fix this.
> > Thank you,
> >
> > Nicola
>
> --
> Nicola Mori, Ph.D.
> INFN sezione di Firenze
> Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
> +390554572660
> mori@xxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx