Re: HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

Wido den Hollander <wido@xxxxxxxx> · Thu, 25 Jul 2019 07:24:20 +0200

On 7/25/19 6:49 AM, Sangwhan Moon wrote:
> Hello,
> 
> I've inherited a Ceph cluster from someone who has left zero documentation or any handover. A couple days ago it decided to show the entire company what it is capable of..
> 
> The health report looks like this:
> 
> [root@host mnt]# ceph -s
>   cluster:
>     id:     809718aa-3eac-4664-b8fa-38c46cdbfdab
>     health: HEALTH_ERR
>             1 MDSs report damaged metadata
>             1 MDSs are read only
>             2 MDSs report slow requests
>             6 MDSs behind on trimming
>             Reduced data availability: 2 pgs stale
>             Degraded data redundancy: 2593/186803520 objects degraded (0.001%), 2 pgs degraded, 2 pgs undersized
>             1 slow requests are blocked > 32 sec. Implicated osds
>             716 stuck requests are blocked > 4096 sec. Implicated osds 25,31,38\

I would start here:

> 
>   services:
>     mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
>     mgr: a(active)
>     mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, 4 up:sta
> ndby-replay
>     osd: 39 osds: 39 up, 38 in
> 
>   data:
>     pools:   5 pools, 706 pgs
>     objects: 91212k objects, 4415 GB
>     usage:   10415 GB used, 13024 GB / 23439 GB avail
>     pgs:     2593/186803520 objects degraded (0.001%)
>              703 active+clean
>              2   stale+active+undersized+degraded

This is a problem! Can you check:

$ ceph pg dump_stuck

The PGs will start with a number like 8.1a where '8' it the pool ID.

Then check:

$ ceph df

To which pools to those PGs belong?

Then check:

$ ceph pg <PGID> query

And the bottom somewhere should show why these PGs are not active. You
might even want to try a restart of these OSDs involved with those two PGs.

Wido

>              1   active+clean+scrubbing+deep
> 
>   io:
>     client:   168 kB/s rd, 6336 B/s wr, 10 op/s rd, 1 op/s wr
> 
> The offending broken MDS entry (damaged metadata) seems to be this:
> 
> mds.ceph-fs-5b997cbf7b-5tjwh: [
>     {
>         "damage_type": "dir_frag",
>         "id": 1190692215,
>         "ino": 2199023258131,
>         "frag": "*",
>         "path": "/f/01/59"
>     }
> ]
> 
> Is there any idea how I can diagnose and find out what is wrong? For the other issues I'm not even sure what/where I need to look into.
> 
> Cheers,
> Sangwhan
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com