Re: OSD journal sizing

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 22 Sep 2016 15:14:05 -0500

On 09/22/2016 03:02 PM, Nathan Cutler wrote:
I've been researching OSD journals lately, and just realized something
that's not particularly nice.

SSDs are getting bigger and bigger. It's typical for customers to use
e.g. a 500GB SSD for their journals. If the OSDs themselves are on
spinners, there is no use in having journals bigger than 10GB because
the 7200 RPMs impose a hard ceiling on throughput that spinners can
achieve.

Now, SSDs need wear-leveling to avoid premature failure. If only a small
region of the SSD is partitioned/used, users may fear (regardless of the
reality, whatever it may be) that this small region will be "pummeled to
death" by Ceph and cause the expensive SSD to fail prematurely.

I *thought* this could be addressed by creating the journal partitions
large enough to fill the entire disk and using the "osd journal size"
parameter to limit how much disk capacity is actually used for
journaling, but now I just noticed that the "osd journal size" parameter
"is ignored if the journal is a block device, and the entire block
device is used."

Have you seen evidence that journal writes are hitting the same cells on 
SSDs when the journals are typically sized?  It seems a little premature 
to me to create huge journals based on the fear that wear leveling isn't 
functioning without any evidence.

And while working on http://tracker.ceph.com/issues/16878 it occurred to
me that large journals are not getting tested much. Is that a valid
assumption?

Next week I plan to attempt to make a reproducer for the bug, and then
try to come up with a patch to fix it. Any ideas/pointers, either here
or in the tracker, would be appreciated.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html