Re: Ceph health error (was: Prioritize recovery over backfilling)

Caspar Smit <casparsmit@xxxxxxxxxxx> · Fri, 8 Jun 2018 07:57:42 +0200

Well i let it run with flags nodown and it looked like it would finish BUT it all went wrong somewhere:

This is now the state:

    health: HEALTH_ERR
            nodown flag(s) set
            5602396/94833780 objects misplaced (5.908%)
            Reduced data availability: 143 pgs inactive, 142 pgs peering, 7 pgs stale
            Degraded data redundancy: 248859/94833780 objects degraded (0.262%), 194 pgs unclean, 21 pgs degraded, 12 pgs undersized
            11 stuck requests are blocked > 4096 sec

    pgs:     13.965% pgs not active
             248859/94833780 objects degraded (0.262%)
             5602396/94833780 objects misplaced (5.908%)
             830 active+clean
             75  remapped+peering
             66  peering
             26  active+remapped+backfill_wait
             6   active+undersized+degraded+remapped+backfill_wait
             6   active+recovery_wait+degraded+remapped
             3   active+undersized+degraded+remapped+backfilling
             3   stale+active+undersized+degraded+remapped+backfill_wait
             3   stale+active+remapped+backfill_wait
             2   active+recovery_wait+degraded
             2   active+remapped+backfilling
             1   activating+degraded+remapped
             1   stale+remapped+peering

#ceph health detail shows:

REQUEST_STUCK 11 stuck requests are blocked > 4096 sec
    11 ops are blocked > 16777.2 sec
    osds 4,7,23,24 have stuck requests > 16777.2 sec

So what happened and what should i do now?

Thank you very much for any help

Kind regards,
Caspar

2018-06-07 13:33 GMT+02:00 Sage Weil <sage@xxxxxxxxxxxx>:
On Wed, 6 Jun 2018, Caspar Smit wrote:

> Hi all,

> 

> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node

> to it.

> 

> osd-max-backfills is at the default 1 so backfilling didn't go very fast

> but that doesn't matter.

> 

> Once it started backfilling everything looked ok:

> 

> ~300 pgs in backfill_wait

> ~10 pgs backfilling (~number of new osd's)

> 

> But i noticed the degraded objects increasing a lot. I presume a pg that is

> in backfill_wait state doesn't accept any new writes anymore? Hence

> increasing the degraded objects?

> 

> So far so good, but once a while i noticed a random OSD flapping (they come

> back up automatically). This isn't because the disk is saturated but a

> driver/controller/kernel incompatibility which 'hangs' the disk for a short

> time (scsi abort_task error in syslog). Investigating further i noticed

> this was already the case before the node expansion.

> 

> These OSD's flapping results in lots of pg states which are a bit worrying:

> 

>              109 active+remapped+backfill_wait

>              80  active+undersized+degraded+remapped+backfill_wait

>              51  active+recovery_wait+degraded+remapped

>              41  active+recovery_wait+degraded

>              27  active+recovery_wait+undersized+degraded+remapped

>              14  active+undersized+remapped+backfill_wait

>              4   active+undersized+degraded+remapped+backfilling

> 

> I think the recovery_wait is more important then the backfill_wait, so i

> like to prioritize these because the recovery_wait was triggered by the

> flapping OSD's

Just a note: this is fixed in mimic.  Previously, we would choose the 

highest-priority PG to start recovery on at the time, but once recovery 

had started, the appearance of a new PG with a higher priority (e.g., 

because it finished peering after the others) wouldn't preempt/cancel the 

other PG's recovery, so you would get behavior like the above.

Mimic implements that preemption, so you should not see behavior like 

this.  (If you do, then the function that assigns a priority score to a 

PG needs to be tweaked.)

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com