Re: OSD journal sizing

Wido den Hollander <wido@xxxxxxxx> · Thu, 22 Sep 2016 22:09:36 +0200 (CEST)

> Op 22 september 2016 om 22:02 schreef Nathan Cutler <ncutler@xxxxxxx>:
> 
> 
> I've been researching OSD journals lately, and just realized something 
> that's not particularly nice.
> 
> SSDs are getting bigger and bigger. It's typical for customers to use 
> e.g. a 500GB SSD for their journals. If the OSDs themselves are on 
> spinners, there is no use in having journals bigger than 10GB because 
> the 7200 RPMs impose a hard ceiling on throughput that spinners can achieve.
> 
> Now, SSDs need wear-leveling to avoid premature failure. If only a small 
> region of the SSD is partitioned/used, users may fear (regardless of the 
> reality, whatever it may be) that this small region will be "pummeled to 
> death" by Ceph and cause the expensive SSD to fail prematurely.
> 

You are making the wrong assumption here. If you take a brand-new SSD which has never been written to and you create just a few small 10GB partitions the SSD's controller will know that all other cells are still unused.

Using wear-leveling it will reallocate cells internally. You will not be hammering the same cells over and over.

Bigger SSDs simply have a longer lifespan.

You can use just a fraction of the disk by using partitions, or use hdparm to set HPA (Host Protected Area) where you "shrink" the SSD.

The SSD will present itself as a 50GB SSD for example while it is a 500GB SSD.

Using hdparm you can also reset a SSD by telling it to reset ALL it's cells back to zero. That way the wear-leveling is also reset.

Fyi, this information mainly comes from working with Intel DC SSDs.

Wido

> I *thought* this could be addressed by creating the journal partitions 
> large enough to fill the entire disk and using the "osd journal size" 
> parameter to limit how much disk capacity is actually used for 
> journaling, but now I just noticed that the "osd journal size" parameter 
> "is ignored if the journal is a block device, and the entire block 
> device is used."
> 
> And while working on http://tracker.ceph.com/issues/16878 it occurred to 
> me that large journals are not getting tested much. Is that a valid 
> assumption?
> 
> Next week I plan to attempt to make a reproducer for the bug, and then 
> try to come up with a patch to fix it. Any ideas/pointers, either here 
> or in the tracker, would be appreciated.
> 
> -- 
> Nathan Cutler
> Software Engineer Distributed Storage
> SUSE LINUX, s.r.o.
> Tel.: +420 284 084 037
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html