Hello,
most probably my Ceph is damaged and cannot be repaired. Nevertheless it
would be very nice to understand a bit more in detail why.
In "short" words:
0. Cluster properties:
- Very small / near minimal: 1 Host for all services
- Centos 8, running as VM
- octopus/15.2.13
- OSD: 3x HDD (5.5TB), 2x SSD (256GB, for metadata)
- Rados-Content: 1x CephFS, 1x RBD
1. The former problem, which leads to need of update:
- ceph-fuse and also CephFS kernel driver running not stable (30s - 3m
stable, then problem: stalling of caja)
- error should be solved by former release (>
https://docs.ceph.com/en/latest/releases/nautilus/ )
"cephfs: client: reset requested_max_size if file write is not wanted
(pr#34767, “Yan, Zheng”)"
- trying to update to pacific/16.2.7
2. Update:
- first attempt with cephadm failed due to only 1x mgr (just for info:
pacific or newer stops update at beginning if mgr<2)
- the downloading of new podman image did not start at all (0% progress
in ceph -s)
- HERE my fault begins:
-> I tried to update podman images for myself
3. The problematic manual update:
- obtaining podman image: podman pull quay.io/ceph/ceph:v16.2.7
- changing all image versions in /var/lib/ceph/<fsid>/osd.<0..4>/... :
imgold=docker.io/ceph/ceph:v15
imgnew=quay.io/ceph/ceph:v16.2.7
servicename=osd.<0..4>
sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.image
sed -i "s|$imgold|$imgnew|g" /var/lib/ceph/<fsid>/$servicename/unit.run
sed -i "s|$imgold|$imgnew|g"
/var/lib/ceph/<fsid>/$servicename/unit.poststop
- fortunately it worked for one bootup:
"ceph versions" showed only v16.2.7
- after booting again, all "ceph" commands did not respond
- trying to revert to v15 by copying /var/lib/ceph/<fsid> from backup
- HERE the next fault appeared: cp instead of rsync
-> compare > https://tracker.ceph.com/issues/17722 (faulty ownership)
- trying to revert to v15 by rsync /var/lib/ceph/<fsid> from backup with
preserved ownerships
4. The actual behaviour:
[ *same* behaviour if Centos 8 VM from backup is used ]
(former test showed, that reverting VM lead to cluster beiing healthy
again after some time)
- ceph -s is running (all services but OSD work again in v15)
- all 5 OSD services are down
- here an example of osd.0
journalctl -u ceph-<fsid>@osd.0.service:
Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R
ceph:ceph /var/lib/ceph/osd/ceph-0
Jan 30 06:51:26 lager bash[52907]: Running command:
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e
--path /var/lib/ceph/osd/ceph-0 --no-mon-config
Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/ln -snf
/dev/ceph-665a13db-ebe8-458b-ab4d-0f2b138106f8/osd-block-6b1e367b-44dd-415b-bc93-04e234a59d9e
/var/lib/ceph/osd/ceph-0/block
Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -h
ceph:ceph /var/lib/ceph/osd/ceph-0/block
Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R
ceph:ceph /dev/dm-5
Jan 30 06:51:26 lager bash[52907]: Running command: /usr/bin/chown -R
ceph:ceph /var/lib/ceph/osd/ceph-0
Jan 30 06:51:26 lager bash[52907]: --> ceph-volume lvm activate
successful for osd ID: 0
Jan 30 06:51:26 lager bash[52907]:
6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
Jan 30 06:51:26 lager systemd[1]: Started Ceph osd.0 for <fsid>.
Jan 30 06:51:27 lager systemd[1]: ceph-<fsid>@osd.0.service: Main
process exited, code=exited, status=1/FAILURE
Jan 30 06:51:28 lager bash[53527]: Error: Failed to evict container:
"": Failed to find container "ceph-<fsid>-osd.0-deactivate" in state: no
container with name or ID ceph-<fsid>-osd.0-deactivate found: no such
container
Jan 30 06:51:28 lager bash[53527]: Error: no container with ID or
name "ceph-<fsid>-osd.0-deactivate" found: no such container
Jan 30 06:51:28 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed
with result 'exit-code'.
Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Service
RestartSec=10s expired, scheduling restart.
Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service:
Scheduled restart job, restart counter is at 5.
Jan 30 06:51:38 lager systemd[1]: Stopped Ceph osd.0 for <fsid>.
Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Start
request repeated too quickly.
Jan 30 06:51:38 lager systemd[1]: ceph-<fsid>@osd.0.service: Failed
with result 'exit-code'.
Jan 30 06:51:38 lager systemd[1]: Failed to start Ceph osd.0 for <fsid>.
- similarity to:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/KFJDV2JJRUOQSJHRKLEIFQB7THUXDS54/
but in my case: ceph-<fsid>@osd.0.service: Main process exited,
code=exited, status=1/FAILURE
- i tried to remove container (maybe the worst part of my actions) at
some stage (id may be different) hoping for recreating:
podman rm
6e0819695af4f188abebc7a5ce576473a195cb9abefb042cd6df2529853effd8
- a strange mismatch between container/OSD status:
ceph -s ... "2 osds down", "osd: 5 osds: 1 up"
ceph osd tree ... 4 out of 5 down
systemctl list-units -all --state=failed ... 3 osd services failed
journalctl ... all OSD failed "Failed to start Ceph osd.<0..4> for
<fsid>."
5. Questions:
a) Is there any way to fix my damaged update and get OSDs back running?
b) Is a set of sane OSDs and a backup of the whole VM enough to bringup
cluster again?
Or in other words: At which places relevant data is stored which is
necessary for fatal recovery?
(e.g. podman images, config files, local databases, ...)
Thanks alot!
Best regards
Flo
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx