Hi,
during an OS upgrade from Ubuntu 18.04 to 20.04 we seem to have
triggered a bcache bug on three OSD hosts. These hosts are used with a
6+2 EC pool used with CephFS, so a number of PGs are affected by the
bug. We were able to restart two of the three hosts (and will run some
extra scrubs on all PGs), but at least 7 PGs have unfound objects now.
I'm currently trying to find out which files are affected to restore
them from backup or inform the users about data corruption for files in
dedicated scratch directories.
This works quite well for most of the files. The 'parent' xattr attached
to the file's first chunks contains the complete path of the file within
the filesystem, so locating the files should be easy with a little help
from ceph-dencoder. But there are some files that do not have the
'parent' xattr:
for pg in $(ceph health detail | grep active\+recovery_unfound | cut -d'
' -f 6); do echo $pg; for obj in $(ceph pg $pg list_unfound | jq -r
'.objects | .[] | .oid.oid'| cut -c1-11); do rados -p bcf_fs_data_rep
getxattr $obj.00000000 parent > $obj.parent; done; done
gives
77.1
error getting xattr bcf_fs_data_rep/1002a143927.00000000/parent: (2) No
such file or directory
.....
'bcf_fs_data_rep' is the first data pool of the filesystem which should
contain the xattr data. But for a number of objects (58 of 928) the
above command is not able to retrieve the information.
Questions:
1. If the 'parent' xattr is not available in the first data pool (and
neither in the meta data pool), what might be the state of these
objects? Can they be deleted by using 'mark_unfound delete'?
2. the list_unfound command only prints 256 objects; how can this limit
be lifted, since some pools have more unfound objects?
Regards,
Burkhard
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx