I'm sure there are many more useful things to graph. One of things I'm
interested in (but haven't found time to research yet) is the journal
usage, with maybe some alerts if the journal is more than 90% full.
This is not likely to be an issue with the default journal config since the wbthrottle code is pretty aggressive about flushing the journal to avoid spiky client IO. Having said that, I tend to agree that we need to do a better job of documenting everything from the perf counters to the states described in dump_historic_ops. Even internally it can get confusing trying to keep track of what's going on where.
Mark
I've always had issues during deep-scrubbing, particularly when there is a lot of deep-scrubbing going on for a long time. For example, I left nodeep-scrub set for a month. Things were pretty painful when I unset it. Everything was fine, but after ~8 hours, I start getting slow requests, then osds marked down for being unresponsive.
So "full journals" is just my most recent theory. I haven't figured out how to test my theory. I've tested (and fixed) a lot of other issues, which have made things better.
It less of a problem now with journals on SSD, but it's something I ran into a several times when my journals were on the HDD. With with the SSD journals, if I do something that affects ~20% of my OSDs, I start having issues. I only have 5 nodes, and I can trigger this by re-formatting all of the OSDs on one node. I haven't (yet) had problems with smaller operations that affect less than 5% of my OSDs. My disk are 4TB, ~70% full, and a fresh format takes 24-48 hours to backfill.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com