Thanks Greg After seeing some recommendations I found in another thread, my impatience got the better of me, and I've start the process again, but there is some logic, I promise :-) I've copied the process from Michael Kidd, I believe, and it goes along the lines of: setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill stopping all OSDs setting all OSDs down & out setting various options in ceph.conf to limit backfill activity etc starting all OSDs wait until all CPU settles to 0% <-- I am here unset the noup flag wait until all CPU settles to 0% unset the noin flag wait until all CPU settles to 0% unset the nobackfill flag wait until all CPU settles to 0% unset the norecover flag remove options from ceph.conf unset the noscrub flag unset the nodeep-scrub flag Currently, host CPU usage is approx the following, so something's changed, and I'm tempted to leave things a little longer before my next step, just in case CPU does eventually stop spinning. I read reports of things taking "a while" even with modern Xeons, so I suppose it's not outside the realms of possibility that an AMD Neo might take days to work things out. We're up to 24.5 hours now: 192.168.12.25 20% 192.168.12.26 1% 192.168.12.27 15% 192.168.12.28 1% 192.168.12.29 12% Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before I restarted this process too. The number of different states is slightly less confusing now, but not by much: :-) 788386/2591752 objects degraded (30.419%) 90 stale+active+clean 2 stale+down+remapped+peering 2 stale+incomplete 1 stale+active+degraded+remapped+wait_backfill+backfill_toofull 1 stale+degraded 1255 stale+active+degraded 32 stale+remapped+peering 773 stale+active+remapped 4 stale+active+degraded+remapped+backfill_toofull 1254 stale+down+peering 278 stale+peering 33 stale+active+remapped+backfill_toofull 563 stale+active+degraded+remapped > Well, you below indicate that osd.14's log says it crashed on an internal heartbeat timeout (usually, it got stuck waiting for disk IO or the kernel/btrfs hung), so that's why. The osd.12 process exists but isn't "up"; osd.14 doesn't even have a socket to connect to. Ah, that does make sense, thank you. > That's not what I'd expect to see (it appears to have timed out and not be recognizing it?) but I don't look at these things too often so maybe that's the normal indication that heartbeats are failing. I'm not sure what this means either. A google for "heartbeat_map is_healthy FileStore::op_tp thread had timed out after" doesn't return much, but I did see this quote from Sage on what looks like a similar matter: > - the filestore op_queue is blocked on the throttler (too much io queued) > - the commit thread is also waiting for ops to finish > - i see no actual thread processing the op_queue > Usually that's because it hit a kernel bug and got killed. Not sure what > else would make that thread disappear... > sage Oh! > Also, you want to find out why they're dying. That's probably the root cause of your issues I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps the knowledge to prove it. If XFS did compression, I'd go with that, but at the moment I need to rely on compression to solve the problem of reclaiming space *within* files which reside on ceph. As far as I remember, overwriting with zeros didn't re-do the thin provisioning on XFS, if that makes sense. Thanks again, Chris _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com