Re: Ceph Luminous - OSD constantly crashing caused by corrupted placement group

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Looks like something went a little wrong with the snapshot metadata in that PG. If the PG is still going active from the other copies, you're probably best off using the ceph-objectstore-tool to remove it on the OSD that is crashing. You could either replace it with an export from one of the other nodes, or let Ceph do the backfilling on its own.
-Greg

On Tue, May 15, 2018 at 2:13 AM Siegfried Höllrigl <siegfried.hoellrigl@xxxxxxxxxx> wrote:


Hi !

We have upgraded our Ceph cluster (3 Mon Servers, 9 OSD Servers, 190
OSDs total) From 10.2.10 to Ceph 12.2.4 and then to 12.2.5.
(A mixture of Ubuntu 14 and 16 with the Repos from
https://download.ceph.com/debian-luminous/)

Now we have the Problem that One ODS is crashing again and again
(approx. once per day). systemd restarts it.

We could now propably identify the problem. It looks like one placement
group (5.9b) causes the crash.
It seems like it doesnt matter if it is running on a filestore or a
bluestore osd.
We could even break it down to some RBDs that were in this pool.
They are already deleted, but it looks like there are some objects on
the osd left, but we cant delete them :


rados -p rbd ls > radosrbdls.txt
echo radosrbdls.txt | grep -vE "($(rados -p rbd ls | grep rbd_header |
grep -o "\.[0-9a-f]*" | sed -e :a -e '$!N; s/\n/|/; ta' -e
's/\./\\./g'))" | grep -E '(rbd_data|journal|rbd_object_map)'
rbd_data.112913b238e1f29.0000000000000e3f
rbd_data.112913b238e1f29.00000000000009d2
rbd_data.112913b238e1f29.0000000000000ba3

rados -p rbd rm rbd_data.112913b238e1f29.0000000000000e3f
error removing rbd>rbd_data.112913b238e1f29.0000000000000e3f: (2) No
such file or directory
rados -p rbd rm rbd_data.112913b238e1f29.00000000000009d2
error removing rbd>rbd_data.112913b238e1f29.00000000000009d2: (2) No
such file or directory
rados -p rbd rm rbd_data.112913b238e1f29.0000000000000ba3
error removing rbd>rbd_data.112913b238e1f29.0000000000000ba3: (2) No
such file or directory

In the "current" directory of the osd there are a lot more files with
this rbd prefix.
Is there any chance to delete these obviously orpahed stuff before the
pg becomes healthy ?
(it is running now at only 2 of 3 osds)

What else could cause such a crash ?


We attatch (hopefully all) of the relevant logs.



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux