Re: OSD stuck down

Eugen Block <eblock@xxxxxx> · Tue, 13 Jun 2023 08:42:42 +0000

Hi,

did you check the MON logs? They should contain some information about  
the reason why the OSD is marked down and out. You could also just try  
to mark it in yourself, does it change anything?

$ ceph osd in 34

I would also take another look into the OSD logs:

cephadm logs --name osd.34

Zitat von Nicola Mori <mori@xxxxxxxxxx>:

Dear Ceph users,

after a host reboot one of the OSDs is now stuck down (and out). I  
tried several times to restart it and even to reboot the host, but  
it still remains down.

# ceph -s
  cluster:
    id:     b1029256-7bb3-11ec-a8ce-ac1f6b627b45
    health: HEALTH_WARN
            4 OSD(s) have spurious read errors
            (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT)

  services:
    mon: 5 daemons, quorum bofur,balin,aka,romolo,dwalin (age 16h)
    mgr: bofur.tklnrn(active, since 16h), standbys: aka.wzystq, balin.hvunfe
    mds: 2/2 daemons up, 1 standby
    osd: 104 osds: 103 up (since 16h), 103 in (since 13h); 4 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 529 pgs
    objects: 18.85M objects, 41 TiB
    usage:   56 TiB used, 139 TiB / 195 TiB avail
    pgs:     68130/150150628 objects misplaced (0.045%)
             522 active+clean
             4   active+remapped+backfilling
             3   active+clean+scrubbing+deep

  io:
    recovery: 46 MiB/s, 21 objects/s

The host is reachable (its other OSDs are in) and from the systemd  
logs of the OSD I don't see anything wrong:

$ sudo systemctl status ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34
● ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service - Ceph  
osd.34 for b1029256-7bb3-11ec-a8ce-ac1f6b627b45
   Loaded: loaded  
(/etc/systemd/system/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@.service;  
enabled; vendor preset: disabled)
   Active: active (running) since Mon 2023-06-12 17:00:25 CEST; 15h ago
 Main PID: 36286 (bash)
    Tasks: 11 (limit: 152154)
   Memory: 20.0M
   CGroup:  
/system.slice/system-ceph\x2db1029256\x2d7bb3\x2d11ec\x2da8ce\x2dac1f6b627b45.slice/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service
           ├─36286 /bin/bash  
/var/lib/ceph/b1029256-7bb3-11ec-a8ce-ac1f6b627b45/osd.34/unit.run
           └─36657 /usr/bin/docker run --rm --ipc=host  
--stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph-osd  
--privileged --group-add=disk --init --name  
ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45-osd-34 --pids-limit=0 -e  
CONTAINER_IMAGE=snack14/ceph-wizard@sha>

Jun 12 17:00:25 balin systemd[1]: Started Ceph osd.34 for  
b1029256-7bb3-11ec-a8ce-ac1f6b627b45.
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown  
-R ceph:ceph /var/lib/ceph/osd/ceph-34
Jun 12 17:00:27 balin bash[36306]: Running command:  
/usr/bin/ceph-bluestore-tool prime-osd-dir --path  
/var/lib/ceph/osd/ceph-34 --no-mon-config --dev  
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown  
-h ceph:ceph  
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown  
-R ceph:ceph /dev/dm-6
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/ln -s  
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d  
/var/lib/ceph/osd/ceph-34/block
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown  
-R ceph:ceph /var/lib/ceph/osd/ceph-34
Jun 12 17:00:27 balin bash[36306]: --> ceph-volume raw activate  
successful for osd ID: 34
Jun 12 17:00:29 balin bash[36657]: debug  
2023-06-12T15:00:29.066+0000 7f818e356540 -1 Falling back to public  
interface

I'd need some help to understand how to fix this.
Thank you,

Nicola

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx