Re: Ceph Luminous - OSD constantly crashing caused by corrupted placement group

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 15 May 2018 15:20:33 -0700

Looks like something went a little wrong with the snapshot metadata in that PG. If the PG is still going active from the other copies, you're probably best off using the ceph-objectstore-tool to remove it on the OSD that is crashing. You could either replace it with an export from one of the other nodes, or let Ceph do the backfilling on its own.-Greg

On Tue, May 15, 2018 at 2:13 AM Siegfried Höllrigl <siegfried.hoellrigl@xxxxxxxxxx> wrote:

Hi !

We have upgraded our Ceph cluster (3 Mon Servers, 9 OSD Servers, 190 

OSDs total) From 10.2.10 to Ceph 12.2.4 and then to 12.2.5.

(A mixture of Ubuntu 14 and 16 with the Repos from 

https://download.ceph.com/debian-luminous/)

Now we have the Problem that One ODS is crashing again and again 

(approx. once per day). systemd restarts it.

We could now propably identify the problem. It looks like one placement 

group (5.9b) causes the crash.

It seems like it doesnt matter if it is running on a filestore or a 

bluestore osd.

We could even break it down to some RBDs that were in this pool.

They are already deleted, but it looks like there are some objects on 

the osd left, but we cant delete them :

rados -p rbd ls > radosrbdls.txt

echo radosrbdls.txt | grep -vE "($(rados -p rbd ls | grep rbd_header | 

grep -o "\.[0-9a-f]*" | sed -e :a -e '$!N; s/\n/|/; ta' -e 

's/\./\\./g'))" | grep -E '(rbd_data|journal|rbd_object_map)'

rbd_data.112913b238e1f29.0000000000000e3f

rbd_data.112913b238e1f29.00000000000009d2

rbd_data.112913b238e1f29.0000000000000ba3

rados -p rbd rm rbd_data.112913b238e1f29.0000000000000e3f

error removing rbd>rbd_data.112913b238e1f29.0000000000000e3f: (2) No 

such file or directory

rados -p rbd rm rbd_data.112913b238e1f29.00000000000009d2

error removing rbd>rbd_data.112913b238e1f29.00000000000009d2: (2) No 

such file or directory

rados -p rbd rm rbd_data.112913b238e1f29.0000000000000ba3

error removing rbd>rbd_data.112913b238e1f29.0000000000000ba3: (2) No 

such file or directory

In the "current" directory of the osd there are a lot more files with 

this rbd prefix.

Is there any chance to delete these obviously orpahed stuff before the 

pg becomes healthy ?

(it is running now at only 2 of 3 osds)

What else could cause such a crash ?

We attatch (hopefully all) of the relevant logs.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com