Re: SSD Hardware recommendation

Christian Balzer <chibi@xxxxxxx> · Thu, 19 Mar 2015 17:57:04 +0900

Hello,

On Wed, 18 Mar 2015 11:41:17 +0100 Francois Lafont wrote:

> Hi,
> 
> Christian Balzer wrote :
> 
> > Consider what you think your IO load (writes) generated by your
> > client(s) will be, multiply that by your replication factor, divide by
> > the number of OSDs, that will give you the base load per OSD. 
> > Then multiply by 2 (journal on OSD) per OSD.
> > Finally based on my experience and measurements (link below) multiply
> > that by at least 6, probably 10 to be on safe side. Use that number to
> > find the SSD that can handle this write load for the time period
> > you're budgeting that cluster for.
> > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
> 
> Thanks Christian for this interesting explanations. I have read your link
> and I'd like to understand why the write amplification is greater than
> the the replication factor. For me, in theory, write amplification
> should be approximatively equal to the replication factor. What are the
> reasons of this difference?
> 
> Er... in fact, after thinking about it a little, I imagine that 1 write
> IO in the client side becomes 2*R IO in the cluster side (where R is the
> replication factor) because there are R IO for the OSD and R IO for the
> journal. So, with R = 2, I can imagine a write amplification equal to 4
> but I don't understand why it's 5 or 6. Is it possible to have
> explanations about this?
> 
You're asking the wrong person, as I'm neither a Ceph or kernel
developer. ^o^
Back then Mark Nelson from the Ceph team didn't expect to see those
numbers as well, but both Mark Wu and I saw them.

Anyways, lets start with the basics and things that are understandable
without any detail knowledge.

Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 (Since
we're talking about SSD cluster here and keep things related to the
question of the OP).

Now a client writes 40MB of data to the cluster.
Assuming an ideal scenario where all PGs are evenly distributed (they won't
be) and this is totally fresh data (resulting in 10 4MB Ceph objects), this
would mean that each OSD will receive 4MB (10 primary PGs, 10 secondary
ones).
With journals on the same SSD (currently the best way based on tests), we
get a write amplification of 2, as that data is written both to the
journal and the actual storage space.

But as my results in the link above showed, that is very much dependent on
the write size. With a 4MB block size (the ideal size for default RBD
pools and objects) I saw even slightly less than the 2x amplifications
expected, I assume that was due to caching and PG imbalances.

Now my guess what happens with small (4KB) writes is that all these small
writes do not coalesce sufficiently before being written to the object on
the OSD. 
So up to 1000 4KB writes could happen to that 4MB object (clearly is it
much less than that, but how much I can't tell), resulting in the same
"blocks" being rewritten several times.

There's also the journaling done by the respective file system (I used
ext4 during that test) and while there are bound to be some differences in
a worst case scenario that could result in another 2x write amplification
(FS journal and actual file).

In addition Ceph updates various files like the omap leveldb and
meta-data, quantifying that however would require much more detailed
analysis or familiarity with the Ceph code.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com