Re: Ceph Health error right after starting balancer

Paul Emmerich <paul.emmerich@xxxxxxxx> · Thu, 31 Oct 2019 18:27:20 +0100

Requests stuck for > 2 hours cannot be attributed to "IO load on the cluster".

Looks like some OSDs really are stuck, things to try:

* run "ceph daemon osd.X dump_blocked_ops" on one of the affected OSDs
to see what is stuck
* try restarting OSDs to see if it clears up automatically

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 31, 2019 at 2:27 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>
> Hi,
>
> after enabling ceph balancer (with command ceph balancer on) the health
> status changed to error.
> This is the current output of ceph health detail:
> root@ld3955:~# ceph health detail
> HEALTH_ERR 1438 slow requests are blocked > 32 sec; 861 stuck requests
> are blocked > 4096 sec; mon ld5505 is low on available space
> REQUEST_SLOW 1438 slow requests are blocked > 32 sec
>     683 ops are blocked > 2097.15 sec
>     436 ops are blocked > 1048.58 sec
>     191 ops are blocked > 524.288 sec
>     78 ops are blocked > 262.144 sec
>     35 ops are blocked > 131.072 sec
>     11 ops are blocked > 65.536 sec
>     4 ops are blocked > 32.768 sec
>     osd.62 has blocked requests > 65.536 sec
>     osds 39,72 have blocked requests > 262.144 sec
>     osds 6,19,67,173,174,187,188,269,434 have blocked requests > 524.288 sec
>     osds
> 8,16,35,36,37,61,63,64,68,73,75,178,186,271,369,420,429,431,433,436 have
> blocked requests > 1048.58 sec
>     osds 3,5,7,24,34,38,40,41,59,66,69,74,180,270,370,421,432,435 have
> blocked requests > 2097.15 sec
> REQUEST_STUCK 861 stuck requests are blocked > 4096 sec
>     25 ops are blocked > 8388.61 sec
>     836 ops are blocked > 4194.3 sec
>     osds 2,28,29,32,60,65,181,185,268,368,423,424,426 have stuck
> requests > 4194.3 sec
>     osds 0,30,70,71,184 have stuck requests > 8388.61 sec
>
> I understand that when balancer starts shifting PGs to other OSDs that
> this caused IO load on the cluster.
> However I don't understand why this is affecting OSD so heavily.
> And I don't understand why OSD of specific type (SSD, NVME) suffer
> although there's no balancing occuring on them.
>
> Regards
> Thomas
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx