How to debug hung on dead OSD?

George Shuklin <george.shuklin@xxxxxxxxx> · Fri, 10 Jun 2016 16:51:07 +0300

Hello.

I'm doing small experimental setup. I have two hosts with few OSD, one 
OSD has been put down intentionaly, but I regardless the second (alive) 
OSD on different host, I see that all IO (rbd, and even rados get) hung 
for long time (more than 30 minutes already).

My configuration:

 -9 2.00000 root ssd
-11 1.00000     host ssd-pp7
  9 1.00000         osd.9        down        0 1.00000
-12 1.00000     host ssd-pp11
  1 0.25000         osd.1          up  1.00000 1.00000
  2 0.25000         osd.2          up  1.00000 1.00000
  3 0.25000         osd.3          up  1.00000 1.00000
 11 0.25000         osd.11         up  1.00000 1.00000

pg map shows that acting OSD was moved from '9' to others.

 ceph health detail
HEALTH_ERR 5 pgs are stuck inactive for more than 300 seconds; 5 pgs 
degraded; 5 pgs stuck inactive; 8 pgs stuck unclean; 5 pgs undersized; 
53 requests are blocked > 32 sec; 2 osds have slow requests; recovery 
2538/8200 objects degraded (30.951%); recovery 1562/8200 objects 
misplaced (19.049%); too few PGs per OSD (1 < min 30)

pg 26.0 is stuck inactive for 1429.756078, current state 
undersized+degraded+peered, last acting [1]
pg 26.7 is stuck inactive for 1429.751221, current state 
undersized+degraded+peered, last acting [2]
pg 26.2 is stuck inactive for 1429.749713, current state 
undersized+degraded+peered, last acting [1]
pg 26.6 is stuck inactive for 1429.763065, current state 
undersized+degraded+peered, last acting [2]
pg 26.5 is stuck inactive for 1429.754325, current state 
undersized+degraded+peered, last acting [1]
pg 26.0 is stuck unclean for 1429.756101, current state 
undersized+degraded+peered, last acting [1]
pg 26.1 is stuck unclean for 1429.778469, current state active+remapped, 
last acting [11,3]
pg 26.2 is stuck unclean for 1429.749733, current state 
undersized+degraded+peered, last acting [1]
pg 26.3 is stuck unclean for 1429.796471, current state active+remapped, 
last acting [1,2]
pg 26.4 is stuck unclean for 1429.762425, current state active+remapped, 
last acting [1,3]
pg 26.5 is stuck unclean for 1429.754349, current state 
undersized+degraded+peered, last acting [1]
pg 26.6 is stuck unclean for 1429.763094, current state 
undersized+degraded+peered, last acting [2]
pg 26.7 is stuck unclean for 1429.751259, current state 
undersized+degraded+peered, last acting [2]

root@pp11:~# ceph osd pool stats  ssd
pool ssd id 26
  nothing is going on

mons are in quorum (all up)

osd dump:

osd.9 down out weight 0 up_from 1055 up_thru 1085 down_at 1089 
last_clean_interval [1017,1052) 78.140.137.210:6800/29731 
78.140.137.210:6801/29731 78.140.137.210:6802/29731 
78.140.137.210:6803/29731 autoout,exists 
2fc49cd5-e48c-4189-a67b-229d09378d1c

What should normally happens in this situation and why it no happen?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com