Re: Fixing a HEALTH_ERR situation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I ended up taking Brett's recommendation and doing a "ceph osd set noscrub" and "ceph osd set nodeep-scrub", then waiting for the running scrubs to finish which doing a "ceph -w" to see what it was doing. Eventually, it reported the following:

2019-05-18 16:08:44.032780 mon.gi-cba-01 [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)

2019-05-18 16:10:13.748132 osd.41 [ERR] 2.798s0 soid 2:19e2f773:::1000255879d.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:12:24.575444 osd.41 [ERR] 2.798s0 soid 2:19e736e2:::10002558362.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:19:57.204557 osd.41 [ERR] 2.798s0 soid 2:19f62945:::10002558ed4.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:23:07.316487 osd.41 [ERR] 2.798s0 soid 2:19fc6ba9:::100025581cc.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:24:41.494480 osd.41 [ERR] 2.798s0 soid 2:19ffaa2a:::10002555405.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:24:52.869234 osd.41 [ERR] 2.798s0 repair 0 missing, 5 inconsistent objects

2019-05-18 16:24:52.870018 osd.41 [ERR] 2.798 repair 5 errors, 5 fixed

2019-05-18 16:24:54.047312 mon.gi-cba-01 [WRN] Health check failed: Degraded data redundancy: 5/632305016 objects degraded (0.000%), 1 pg degraded (PG_DEGRADED)

2019-05-18 16:24:54.047359 mon.gi-cba-01 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 5 scrub errors)

2019-05-18 16:24:54.047383 mon.gi-cba-01 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)

2019-05-18 16:24:59.232439 mon.gi-cba-01 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 5/632305016 objects degraded (0.000%), 1 pg degraded)

2019-05-18 17:00:00.000099 mon.gi-cba-01 [WRN] overall HEALTH_WARN noscrub,nodeep-scrub flag(s) set


after that, I "ceph osd unset noscrub" and "ceph osd unset nodeep-scrub" and the system was back to HEALTH_OK. Still seems like black magic, but I guess I'm happy now... Thanks!


On Sun, May 19, 2019 at 2:44 AM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
Check out the log of the primary OSD in that PG to see what happened during scrubbing

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Sun, May 19, 2019 at 12:41 AM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:
I have tried ceph pg repair several times. It claims "instructing pg 2.798s0 on osd.41 to repair" but then nothing happens as far as I can tell. Any way of knowing if it's doing more?

On Sat, May 18, 2019 at 3:33 PM Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> wrote:
I would try the ceph pg repair. If you see the pg go into deep scrubbing, then back to inconsistent you probably have a bad drive. Find which of the drives in the pg are bad (pg query or go to the host and look through dmesg). Take that osd offline and mark it out. Once backfill is complete, it should clear up.

On Sat, May 18, 2019, 6:05 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:
We are testing a ceph cluster mostly using cephfs. We are using an erasure-code pool, and have been loading it up with data. Recently, we got a HEALTH_ERR response when we were querying the ceph status. We stopped all activity to the filesystem, and waited to see if the error would go away. It didn't. Then we tried a couple of suggestions from the internet (ceph pg repair, ceph pg scrub, ceph pg deep-scrub) to no avail. I'm not sure how to find out more information about what the problem is, and how to repair the filesystem to bring it back to normal health. Any suggestions?

Current status:

# ceph -s

  cluster:

    id:     28ef32f1-4350-491b-9003-b19b9c3a2076

    health: HEALTH_ERR

            5 scrub errors

            Possible data damage: 1 pg inconsistent

 

  services:

    mon: 3 daemons, quorum gi-cba-01,gi-cba-02,gi-cba-03

    mgr: gi-cba-01(active), standbys: gi-cba-02, gi-cba-03

    mds: backups-1/1/1 up  {0=gi-cbmd=up:active}

    osd: 87 osds: 87 up, 87 in

 

  data:

    pools:   2 pools, 4096 pgs

    objects: 90.98 M objects, 134 TiB

    usage:   210 TiB used, 845 TiB / 1.0 PiB avail

    pgs:     4088 active+clean

             5    active+clean+scrubbing+deep

             2    active+clean+scrubbing

             1    active+clean+inconsistent

# ceph health detail

HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent

OSD_SCRUB_ERRORS 5 scrub errors

PG_DAMAGED Possible data damage: 1 pg inconsistent

    pg 2.798 is active+clean+inconsistent, acting [41,50,17,2,86,70,61]

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux