Hi, I have some strange OSD problems. Before the weekend I started some rsync tests over CephFS, on a cache pool with underlying EC KV pool. Today the cluster is completely degraded: [root at ceph003 ~]# ceph status cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168 pgs down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck unclean; 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked > 32 sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361 objects misplaced (0.007%); 1 near full osd(s) monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby osdmap e719: 48 osds: 18 up, 18 in pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects 2282 GB used, 31397 GB / 33680 GB avail 15170/27902361 objects degraded (0.054%); 1922/27902361 objects misplaced (0.007%) 68 down+remapped+peering 1 active 754 active+clean 1 stale+incomplete 1 stale+active+clean+scrubbing 14 active+undersized+degraded+remapped 7 incomplete 100 down+peering 9 active+remapped 59 stale+active+undersized+degraded 19 active+undersized+degraded+remapped+backfill_toofull 311 active+undersized+degraded I tried to figure out what happened in the global logs: 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4159 kB/s wr, 45 op/s 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap v65893: 1344 pgs: 1344 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 561 kB/s wr, 11 op/s 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.000079 >= grace 20.000000) 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap e117: 48 osds: 47 up, 48 in 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 1353 kB/s wr, 13 op/s 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap e118: 48 osds: 47 up, 48 in 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.000137 secs 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN] slow request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339: osd_op(client.7448.1:17751783 10000203eac.0000000e [write 0~319488 [1 at -1],startsync 0~0] 1.b 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN] 7 slow requests, 6 included below; oldest blocked for > 31.001947 secs 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN] slow request 30.998110 seconds old, received at 2014-09-13 08:00:49.339176: osd_op(client.7448.1:17751787 10000203eac.0000000e [write 319488~65536 [1 at -1],startsync 0~0] This is happening OSD after OSD.. I tried to check the individual log of the osds, but all the individual logs stop abruptly (also from the osds that are still running): 2014-09-12 14:25:51.205276 7f3517209700 0 log [WRN] : 41 slow requests, 1 included below; oldest blocked for > 38.118088 secs 2014-09-12 14:25:51.205337 7f3517209700 0 log [WRN] : slow request 36.558286 seconds old, received at 2014-09-12 14:25:14.646836: osd_op(client.7448.1:2458392 1000006328f.0000000b [write 3989504~204800 [1 at -1],startsync 0~0] 1.9337bf4b snapc 1=[] ondisk+write e116) currently reached pg 2014-09-12 14:25:53.205586 7f3517209700 0 log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 40.118530 secs 2014-09-12 14:25:53.205679 7f3517209700 0 log [WRN] : slow request 30.541026 seconds old, received at 2014-09-12 14:25:22.664538: osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691 [1 at -1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116) currently reached pg 2014-09-12 17:52:40.503917 7f34e8ed2700 0 -- 10.141.8.181:6809/29664 >> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0 cs=0 l=1 c=0x1bc8b9c0).accept replacing existing (lossy) channel (new one lossy=1) I *think* the absence of the logs is some issue related to another issue I just found (http://tracker.ceph.com/issues/9470). So I can't found out the original problem with the log files.. Is there any other way I can find out what started the crashing of 30 osds ? Thanks!! Kenneth