How to debug hung on dead OSD?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.

I'm doing small experimental setup. I have two hosts with few OSD, one OSD has been put down intentionaly, but I regardless the second (alive) OSD on different host, I see that all IO (rbd, and even rados get) hung for long time (more than 30 minutes already).

My configuration:

 -9 2.00000 root ssd
-11 1.00000     host ssd-pp7
  9 1.00000         osd.9        down        0 1.00000
-12 1.00000     host ssd-pp11
  1 0.25000         osd.1          up  1.00000 1.00000
  2 0.25000         osd.2          up  1.00000 1.00000
  3 0.25000         osd.3          up  1.00000 1.00000
 11 0.25000         osd.11         up  1.00000 1.00000

pg map shows that acting OSD was moved from '9' to others.

 ceph health detail
HEALTH_ERR 5 pgs are stuck inactive for more than 300 seconds; 5 pgs degraded; 5 pgs stuck inactive; 8 pgs stuck unclean; 5 pgs undersized; 53 requests are blocked > 32 sec; 2 osds have slow requests; recovery 2538/8200 objects degraded (30.951%); recovery 1562/8200 objects misplaced (19.049%); too few PGs per OSD (1 < min 30)

pg 26.0 is stuck inactive for 1429.756078, current state undersized+degraded+peered, last acting [1] pg 26.7 is stuck inactive for 1429.751221, current state undersized+degraded+peered, last acting [2] pg 26.2 is stuck inactive for 1429.749713, current state undersized+degraded+peered, last acting [1] pg 26.6 is stuck inactive for 1429.763065, current state undersized+degraded+peered, last acting [2] pg 26.5 is stuck inactive for 1429.754325, current state undersized+degraded+peered, last acting [1] pg 26.0 is stuck unclean for 1429.756101, current state undersized+degraded+peered, last acting [1] pg 26.1 is stuck unclean for 1429.778469, current state active+remapped, last acting [11,3] pg 26.2 is stuck unclean for 1429.749733, current state undersized+degraded+peered, last acting [1] pg 26.3 is stuck unclean for 1429.796471, current state active+remapped, last acting [1,2] pg 26.4 is stuck unclean for 1429.762425, current state active+remapped, last acting [1,3] pg 26.5 is stuck unclean for 1429.754349, current state undersized+degraded+peered, last acting [1] pg 26.6 is stuck unclean for 1429.763094, current state undersized+degraded+peered, last acting [2] pg 26.7 is stuck unclean for 1429.751259, current state undersized+degraded+peered, last acting [2]

root@pp11:~# ceph osd pool stats  ssd
pool ssd id 26
  nothing is going on

mons are in quorum (all up)

osd dump:

osd.9 down out weight 0 up_from 1055 up_thru 1085 down_at 1089 last_clean_interval [1017,1052) 78.140.137.210:6800/29731 78.140.137.210:6801/29731 78.140.137.210:6802/29731 78.140.137.210:6803/29731 autoout,exists 2fc49cd5-e48c-4189-a67b-229d09378d1c



What should normally happens in this situation and why it no happen?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux