Re: MDS_DAMAGE dir_frag

Venky Shankar <vshankar@xxxxxxxxxx> · Wed, 14 Dec 2022 22:05:43 +0530

Hi Sascha,

On Tue, Dec 13, 2022 at 6:43 PM Sascha Lucas <ceph-users@xxxxxxxxx> wrote:
>
> Hi,
>
> On Mon, 12 Dec 2022, Sascha Lucas wrote:
>
> > On Mon, 12 Dec 2022, Gregory Farnum wrote:
>
> >> Yes, we’d very much like to understand this. What versions of the server
> >> and kernel client are you using? What platform stack — I see it looks like
> >> you are using CephFS through the volumes interface? The simplest
> >> possibility I can think of here is that you are running with a bad kernel
> >> and it used async ops poorly, maybe? But I don’t remember other spontaneous
> >> corruptions of this type anytime recent.
> >
> > Ceph "servers" like MONs, OSDs, MDSs etc. are all 17.2.5/cephadm/podman. The
> > filesystem kernel clients are co-located on the same hosts running the
> > "servers". For some other reason OS is still RHEL 8.5 (yes with community
> > ceph). Kernel is 4.18.0-348.el8.x86_64 from release media. Just one
> > filesystem kernel client is at 4.18.0-348.23.1.el8_5.x86_64 from EOL of 8.5.
> >
> > Are there known issues with this kernel versions?
> >
> >> Have you run a normal forward scrub (which is non-disruptive) to check if
> >> there are other issues?
> >
> > So far I haven't dared, but will do so tomorrow.
>
> Just an update: "scrub / recursive,repair" does not uncover additional
> errors. But also does not fix the single dirfrag error.

File system scrub does not clear entries from the damage list.

The damage type you are running into ("dir_frag") implies that the
object for directory "V_7770505" is lost (from the metadata pool).
This results in files under that directory to be unavailable. Good
news is that you can regenerate the lost object by scanning the data
pool. This is documented here:

        https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#recovery-from-missing-metadata-objects

(You'd need not run the cephfs-table-tool or cephfs-journal-tool
command though. Also, this could take time if you have lots of objects
in the data pool)

Since you mention that you do not see directory "CV_MAGNETIC" and no
other scrub errors are seen, it's possible that the application using
cephfs removed it since it was no longer needed (the data pool might
have some leftover object though).

>
> Thanks, Sascha.
>
> [2] https://www.spinics.net/lists/ceph-users/msg53202.html
> [3] https://docs.ceph.com/en/quincy/cephfs/disaster-recovery/#metadata-damage-and-repair
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx