On 09/22/2016 03:02 PM, Nathan Cutler wrote:
I've been researching OSD journals lately, and just realized something that's not particularly nice. SSDs are getting bigger and bigger. It's typical for customers to use e.g. a 500GB SSD for their journals. If the OSDs themselves are on spinners, there is no use in having journals bigger than 10GB because the 7200 RPMs impose a hard ceiling on throughput that spinners can achieve. Now, SSDs need wear-leveling to avoid premature failure. If only a small region of the SSD is partitioned/used, users may fear (regardless of the reality, whatever it may be) that this small region will be "pummeled to death" by Ceph and cause the expensive SSD to fail prematurely. I *thought* this could be addressed by creating the journal partitions large enough to fill the entire disk and using the "osd journal size" parameter to limit how much disk capacity is actually used for journaling, but now I just noticed that the "osd journal size" parameter "is ignored if the journal is a block device, and the entire block device is used."
Have you seen evidence that journal writes are hitting the same cells on SSDs when the journals are typically sized? It seems a little premature to me to create huge journals based on the fear that wear leveling isn't functioning without any evidence.
And while working on http://tracker.ceph.com/issues/16878 it occurred to me that large journals are not getting tested much. Is that a valid assumption? Next week I plan to attempt to make a reproducer for the bug, and then try to come up with a patch to fix it. Any ideas/pointers, either here or in the tracker, would be appreciated.
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html