The osd.8 log shows it doing some deep scrubbing here. Perhaps that isWhen I first noticed the CPU usage, I checked iotop and iostat. Both said there was no disk activity, on any OSD.
Taking osd.8 down (regardless of the noout flag) was the only way to things to respond. I have not set nodown, just noout. When I got in this morning, I had 4 more flapping OSDs: osd.4, osd.12, osd.13, and osd.6. All 4 daemons were all using 100% CPU, and no disk I/O. osd.1 and osd.14 are the only ones currently using disk I/O. There are 3 PGs being deepscrubbed: root@ceph1c:/var/log/radosgw-agent# ceph pg dump | grep deep dumped all in format plain pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 11.774 8682 0 0 0 7614655060 3001 3001 active+clean+scrubbing+deep 2014-03-27 10:20:30.598032 8381'5180514 8521:6520833 [13,4] [13,4] 7894'5176984 2014-03-20 04:41:48.762996 7894'5176984 2014-03-20 04:41:48.762996 11.698 8587 0 0 0 7723737171 3001 3001 active+clean+scrubbing+deep 2014-03-27 10:16:31.292487 8383'483312 8521:618864 [14,1] [14,1] 7894'479783 2014-03-20 03:53:18.024015 7894'479783 2014-03-20 03:53:18.024015 11.d8 8743 0 0 0 7570365909 3409 3409 active+clean+scrubbing+deep 2014-03-27 10:15:39.558121 8396'1753407 8521:2417672 [12,6] [12,6] 7894'1459230 2014-03-20 02:40:22.123236 7894'1459230 2014-03-20 02:40:22.123236 These PGs are on the 6 OSDs mentioned. osd.1 and osd.14 are not using 100% CPU and are using disk IO. osd.12, osd.6, osd.4, and osd.13 are using 100% CPU, and 0 kB/s of disk IO. Here's iostat on ceph0c, which contains osd.1 (/dev/sdd), osd.4 (/dev/sde), and osd.6 (/dev/sdg): root@ceph0c:/var/log/ceph# iostat -p sdd,sde,sdh 1 Linux 3.5.0-46-generic (ceph0c) 03/27/2014 _x86_64_ (8 CPU) <snip> avg-cpu: %user %nice %system %iowait %steal %idle 32.64 0.00 5.52 4.42 0.00 57.42 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdd 113.00 900.00 0.00 900 0 sdd1 113.00 900.00 0.00 900 0 sde 0.00 0.00 0.00 0 0 sde1 0.00 0.00 0.00 0 0 sdh 0.00 0.00 0.00 0 0 sdh1 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 29.90 0.00 4.41 2.82 0.00 62.87 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdd 181.00 1332.00 0.00 1332 0 sdd1 181.00 1332.00 0.00 1332 0 sde 22.00 8.00 328.00 8 328 sde1 18.00 8.00 328.00 8 328 sdh 18.00 4.00 228.00 4 228 sdh1 15.00 4.00 228.00 4 228 avg-cpu: %user %nice %system %iowait %steal %idle 30.21 0.00 4.26 1.71 0.00 63.82 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdd 180.00 1044.00 200.00 1044 200 sdd1 177.00 1044.00 200.00 1044 200 sde 0.00 0.00 0.00 0 0 sde1 0.00 0.00 0.00 0 0 sdh 0.00 0.00 0.00 0 0 sdh1 0.00 0.00 0.00 0 0 So it's not no disk activity, but it's pretty close. The disks continue to have 0 kB_read and 0kB_wrtn for the next 60 seconds. It's much lower than I would expect for OSDs executing a deepscrub. I restarted the 4 flapping OSDs. They recovered, then started flapping within 5 minutes. I shut all of the ceph daemons down, and rebooted all nodes at the same time. The OSDs return to 100% CPU usage very soon after boot. I was going to ask if I should zap osd.8 and re-add it to the cluster. I don't think that's possible now. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com