Re: Help: corrupt pg

Eugen Block <eblock@xxxxxx> · Wed, 25 Mar 2020 14:22:42 +0000

Hi,

is there any chance to recover the other failing OSDs that seem to  
have one chunk of this PG? Do the other OSDs fail with the same error?

Zitat von Jake Grimmett <jog@xxxxxxxxxxxxxxxxx>:

Dear All,

We are "in a bit of a pickle"...

No reply to my message (23/03/2020),  subject  "OSD: FAILED  
ceph_assert(clone_size.count(clone))"

So I'm presuming it's not possible to recover the crashed OSD

This is bad news, as one pg may be lost, (we are using EC 8+2, pg  
dump shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )

Without this pg we have 1.8PB of broken cephfs.

I could rebuild the cluster from scratch, but this means no user  
backups for a couple of weeks.

The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data  
(replicated NVMe metdata pool) and is running Nautilus 14.2.8

Clearly, it would be nicer if we could fix the OSD, but if this  
isn't possible, can someone confirm that the right procedure to  
recover from a corrupt pg is:

1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is  
"rm" fine?
4) destroy the bad pg
# ceph osd  force-create-pg 5.750
5) Copy the missing files back with rsync or similar...

a better "recipe" or other advice gratefully received,

best regards,
Jake

****

Note: I am working from home until further notice.

For help, contact unixadmin@xxxxxxxxxxxxxxxxx
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx