OSD down after failed update from octopus/15.2.13

Florian Protze <amail@xxxxxxxxxxxxxxxx> · Sun, 30 Jan 2022 14:44:52 +0100

Hello,

most probably my Ceph is damaged and cannot be repaired. Nevertheless it 
would be very nice to understand a bit more in detail why.
In "short" words:

0. Cluster properties:
- Very small / near minimal: 1 Host for all services
- Centos 8, running as VM
- octopus/15.2.13
- OSD: 3x HDD (5.5TB), 2x SSD (256GB, for metadata)
- Rados-Content: 1x CephFS, 1x RBD

1. The former problem, which leads to need of update:
- ceph-fuse and also CephFS kernel driver running not stable (30s - 3m 
stable, then problem: stalling of caja)
- error should be solved by former release (> 
https://docs.ceph.com/en/latest/releases/nautilus/ )
"cephfs: client: reset requested_max_size if file write is not wanted 
(pr#34767, “Yan, Zheng”)"
- trying to update to pacific/16.2.7

2. Update:
- first attempt with cephadm failed due to only 1x mgr (just for info: 
pacific or newer stops update at beginning if mgr<2)
- the downloading of new podman image did not start at all (0% progress 
in ceph -s)
- HERE my fault begins:
-> I tried to update podman images for myself

3. The problematic manual update:
- obtaining podman image: podman pull quay.io/ceph/ceph:v16.2.7
- changing all image versions in /var/lib/ceph/<fsid>/osd.<0..4>/... :
  imgold=docker.io/ceph/ceph:v15
  imgnew=quay.io/ceph/ceph:v16.2.7
  servicename=osd.<0..4>
  sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.image
  sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.run
  sed -i "s|$imgold|$imgnew|g" 
/var/lib/ceph/<fsid>/$servicename/unit.poststop
- fortunately it worked for one bootup:
 "ceph versions" showed only v16.2.7
- after booting again, all "ceph" commands did not respond
- trying to revert to v15 by copying /var/lib/ceph/<fsid> from backup
- HERE the next fault appeared: cp instead of rsync
-> compare  > https://tracker.ceph.com/issues/17722 (faulty ownership)
- trying to revert to v15 by rsync /var/lib/ceph/<fsid> from backup with 
preserved ownerships

4. The actual behaviour:
[ *same* behaviour if Centos 8 VM from backup is used ]
 (former test showed, that reverting VM lead to cluster beiing healthy 
again after some time)
- ceph -s is running (all services but OSD work again in v15)
- all 5 OSD services are down
- here an example of osd.0
 journalctl -u ceph-<fsid>@osd.0.service:
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-0
  Jan 30 06:51:26 lager bash[52907]: Running command: 
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev 
/dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e 
--path /var/lib/ceph/osd/ceph-0 --no-mon-config
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/ln -snf 
/dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e 
/var/lib/ceph/osd/ceph-0/block
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -h 
ceph:ceph /var/lib/ceph/osd/ceph-0/block
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R 
ceph:ceph /dev/dm-5
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-0
  Jan 30 06:51:26 lager bash[52907]: --> ceph-volume lvm activate 
successful for osd ID: 0
  Jan 30 06:51:26 lager bash[52907]: 
6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
  Jan 30 06:51:26 lager systemd[1]: Started Ceph osd.0 for <fsid>.
  Jan 30 06:51:27 lager systemd[1]: ceph-<fsid>@osd.0.service: Main 
process exited, code=exited, status=1/FAILURE
  Jan 30 06:51:28 lager bash[53527]: Error: Failed to evict container: 
"": Failed to find container "ceph-<fsid>-osd.0-deactivate" in state: no 
container with name or ID ceph-<fsid>-osd.0-deactivate found: no such 
container
  Jan 30 06:51:28 lager bash[53527]: Error: no container with ID or 
name "ceph-<fsid>-osd.0-deactivate" found: no such container
  Jan 30 06:51:28 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed 
with result 'exit-code'.
  Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Service 
RestartSec=10s expired, scheduling restart.
  Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: 
Scheduled restart job, restart counter is at 5.
  Jan 30 06:51:38 lager systemd[1]: Stopped Ceph osd.0 for <fsid>.
  Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Start 
request repeated too quickly.
  Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed 
with result 'exit-code'.
  Jan 30 06:51:38 lager systemd[1]: Failed to start Ceph osd.0 for <fsid>.
- similarity to: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/KFJDV2JJRUOQSJHRKLEIFQB7THUXDS54/
    but in my case: ceph-<fsid>@osd.0.service: Main process exited, 
code=exited, status=1/FAILURE
- i tried to remove container (maybe the worst part of my actions) at 
some stage (id may be different) hoping for recreating:
    podman rm 
6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
- a strange mismatch between container/OSD status:
 ceph -s ... "2 osds down", "osd: 5 osds: 1 up"
 ceph osd tree ... 4 out of 5 down
 systemctl list-units -all --state=failed ... 3 osd services failed
 journalctl ... all OSD failed "Failed to start Ceph osd.<0..4> for 
<fsid>."

5. Questions:
a) Is there any way to fix my damaged update and get OSDs back running?
b) Is a set of sane OSDs and a backup of the whole VM enough to bringup 
cluster again?
 Or in other words: At which places relevant data is stored which is 
necessary for fatal recovery?
    (e.g. podman images, config files, local databases, ...)

Thanks alot!

Best regards
Flo

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx