Hi there, today I had an osd crash with ceph 0.87/giant which made my hole cluster unusable for 45 Minutes. First it began with a disk error: sd 0:1:2:0: [sdc] CDB: Read(10)Read(10):: 28 28 00 00 0d 15 fe d0 fd 7b e8 f8 00 00 00 00 b0 08 00 00 XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5. Then most other osds found out that my osd.3 is down: 2014-12-16 08:45:15.873478 mon.0 10.67.1.11:6789/0 3361077 : cluster [INF] osd.3 10.67.1.11:6810/713621 failed (42 reports from 35 peers after 23.642482 >= grace 23.348982) 5 minutes later the osd is marked as out: 2014-12-16 08:50:21.095903 mon.0 10.67.1.11:6789/0 3361367 : cluster [INF] osd.3 out (down for 304.581079) However, since 8:45 until 9:20 I have 1000 slow requests and 107 incomplete pgs. Many requests are not answered: 2014-12-16 08:46:03.029094 mon.0 10.67.1.11:6789/0 3361126 : cluster [INF] pgmap v6930583: 4224 pgs: 4117 active+clean, 107 incomplete; 7647 GB data, 19090 GB used, 67952 GB / 87042 GB avail; 2307 kB/s rd, 2293 kB/s wr, 407 op/s Also a recovery to another osd was not starting Seems the osd thinks it is still up and all other osds think this osd is down ? I found this in the log of osd3: ceph-osd.3.log:2014-12-16 08:45:19.319152 7faf81296700 0 log_channel(default) log [WRN] : map e61177 wrongly marked me down ceph-osd.3.log: -440> 2014-12-16 08:45:19.319152 7faf81296700 0 log_channel(default) log [WRN] : map e61177 wrongly marked me down Luckily I was able to restart osd3 and everything was working again but I do not understand what has happened. The cluster ways simply not usable for 45 Minutes. Any ideas Thanks Christoph _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com