OSD down after failed update from octopus/15.2.13

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

most probably my Ceph is damaged and cannot be repaired. Nevertheless it would be very nice to understand a bit more in detail why.
In "short" words:

0. Cluster properties:
- Very small / near minimal: 1 Host for all services
- Centos 8, running as VM
- octopus/15.2.13
- OSD: 3x HDD (5.5TB), 2x SSD (256GB, for metadata)
- Rados-Content: 1x CephFS, 1x RBD

1. The former problem, which leads to need of update:
- ceph-fuse and also CephFS kernel driver running not stable (30s - 3m stable, then problem: stalling of caja) - error should be solved by former release (> https://docs.ceph.com/en/latest/releases/nautilus/ ) "cephfs: client: reset requested_max_size if file write is not wanted (pr#34767, “Yan, Zheng”)"
- trying to update to pacific/16.2.7

2. Update:
- first attempt with cephadm failed due to only 1x mgr (just for info: pacific or newer stops update at beginning if mgr<2) - the downloading of new podman image did not start at all (0% progress in ceph -s)
- HERE my fault begins:
-> I tried to update podman images for myself

3. The problematic manual update:
- obtaining podman image: podman pull quay.io/ceph/ceph:v16.2.7
- changing all image versions in /var/lib/ceph/<fsid>/osd.<0..4>/... :
  imgold=docker.io/ceph/ceph:v15
  imgnew=quay.io/ceph/ceph:v16.2.7
  servicename=osd.<0..4>
  sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.image
  sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.run
  sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.poststop
- fortunately it worked for one bootup:
 "ceph versions" showed only v16.2.7
- after booting again, all "ceph" commands did not respond
- trying to revert to v15 by copying /var/lib/ceph/<fsid> from backup
- HERE the next fault appeared: cp instead of rsync
-> compare  > https://tracker.ceph.com/issues/17722 (faulty ownership)
- trying to revert to v15 by rsync /var/lib/ceph/<fsid> from backup with preserved ownerships

4. The actual behaviour:
[ *same* behaviour if Centos 8 VM from backup is used ]
 (former test showed, that reverting VM lead to cluster beiing healthy again after some time)
- ceph -s is running (all services but OSD work again in v15)
- all 5 OSD services are down
- here an example of osd.0
 journalctl -u ceph-<fsid>@osd.0.service:
  Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0   Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e --path /var/lib/ceph/osd/ceph-0 --no-mon-config   Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/ln -snf /dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e /var/lib/ceph/osd/ceph-0/block   Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block   Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-5   Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0   Jan 30 06:51:26 lager bash[52907]: --> ceph-volume lvm activate successful for osd ID: 0   Jan 30 06:51:26 lager bash[52907]: 6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
  Jan 30 06:51:26 lager systemd[1]: Started Ceph osd.0 for <fsid>.
  Jan 30 06:51:27 lager systemd[1]: ceph-<fsid>@osd.0.service: Main process exited, code=exited, status=1/FAILURE   Jan 30 06:51:28 lager bash[53527]: Error: Failed to evict container: "": Failed to find container "ceph-<fsid>-osd.0-deactivate" in state: no container with name or ID ceph-<fsid>-osd.0-deactivate found: no such container   Jan 30 06:51:28 lager bash[53527]: Error: no container with ID or name "ceph-<fsid>-osd.0-deactivate" found: no such container   Jan 30 06:51:28 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed with result 'exit-code'.   Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Service RestartSec=10s expired, scheduling restart.   Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Scheduled restart job, restart counter is at 5.
  Jan 30 06:51:38 lager systemd[1]: Stopped Ceph osd.0 for <fsid>.
  Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Start request repeated too quickly.   Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed with result 'exit-code'.
  Jan 30 06:51:38 lager systemd[1]: Failed to start Ceph osd.0 for <fsid>.
- similarity to: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/KFJDV2JJRUOQSJHRKLEIFQB7THUXDS54/     but in my case: ceph-<fsid>@osd.0.service: Main process exited, code=exited, status=1/FAILURE - i tried to remove container (maybe the worst part of my actions) at some stage (id may be different) hoping for recreating:     podman rm 6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
- a strange mismatch between container/OSD status:
 ceph -s ... "2 osds down", "osd: 5 osds: 1 up"
 ceph osd tree ... 4 out of 5 down
 systemctl list-units -all --state=failed ... 3 osd services failed
 journalctl ... all OSD failed "Failed to start Ceph osd.<0..4> for <fsid>."

5. Questions:
a) Is there any way to fix my damaged update and get OSDs back running?
b) Is a set of sane OSDs and a backup of the whole VM enough to bringup cluster again?  Or in other words: At which places relevant data is stored which is necessary for fatal recovery?
    (e.g. podman images, config files, local databases, ...)


Thanks alot!

Best regards
Flo

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux