Ceph Health error right after starting balancer

Thomas Schneider <74cmonty@xxxxxxxxx> · Thu, 31 Oct 2019 14:26:45 +0100

Hi,

after enabling ceph balancer (with command ceph balancer on) the health
status changed to error.
This is the current output of ceph health detail:
root@ld3955:~# ceph health detail
HEALTH_ERR 1438 slow requests are blocked > 32 sec; 861 stuck requests
are blocked > 4096 sec; mon ld5505 is low on available space
REQUEST_SLOW 1438 slow requests are blocked > 32 sec
    683 ops are blocked > 2097.15 sec
    436 ops are blocked > 1048.58 sec
    191 ops are blocked > 524.288 sec
    78 ops are blocked > 262.144 sec
    35 ops are blocked > 131.072 sec
    11 ops are blocked > 65.536 sec
    4 ops are blocked > 32.768 sec
    osd.62 has blocked requests > 65.536 sec
    osds 39,72 have blocked requests > 262.144 sec
    osds 6,19,67,173,174,187,188,269,434 have blocked requests > 524.288 sec
    osds
8,16,35,36,37,61,63,64,68,73,75,178,186,271,369,420,429,431,433,436 have
blocked requests > 1048.58 sec
    osds 3,5,7,24,34,38,40,41,59,66,69,74,180,270,370,421,432,435 have
blocked requests > 2097.15 sec
REQUEST_STUCK 861 stuck requests are blocked > 4096 sec
    25 ops are blocked > 8388.61 sec
    836 ops are blocked > 4194.3 sec
    osds 2,28,29,32,60,65,181,185,268,368,423,424,426 have stuck
requests > 4194.3 sec
    osds 0,30,70,71,184 have stuck requests > 8388.61 sec

I understand that when balancer starts shifting PGs to other OSDs that
this caused IO load on the cluster.
However I don't understand why this is affecting OSD so heavily.
And I don't understand why OSD of specific type (SSD, NVME) suffer
although there's no balancing occuring on them.

Regards
Thomas
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx