Re: SSD Hardware recommendation

Francois Lafont <flafdivers@xxxxxxx> · Mon, 23 Mar 2015 02:33:20 +0100

Hi,

Sorry Christian for my late answer. I was a little busy.

Christian Balzer a wrote:

> You're asking the wrong person, as I'm neither a Ceph or kernel
> developer. ^o^

No, no, the rest of the message proves to me that I talk to the
right person. ;)

> Back then Mark Nelson from the Ceph team didn't expect to see those
> numbers as well, but both Mark Wu and I saw them.
> 
> Anyways, lets start with the basics and things that are understandable
> without any detail knowledge.
> 
> Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 (Since
> we're talking about SSD cluster here and keep things related to the
> question of the OP).
> 
> Now a client writes 40MB of data to the cluster.
> Assuming an ideal scenario where all PGs are evenly distributed (they won't
> be) and this is totally fresh data (resulting in 10 4MB Ceph objects), this
> would mean that each OSD will receive 4MB (10 primary PGs, 10 secondary
> ones).
> With journals on the same SSD (currently the best way based on tests), we
> get a write amplification of 2, as that data is written both to the
> journal and the actual storage space.
> 
> But as my results in the link above showed, that is very much dependent on
> the write size. With a 4MB block size (the ideal size for default RBD
> pools and objects) I saw even slightly less than the 2x amplifications
> expected, I assume that was due to caching and PG imbalances.
> 
> Now my guess what happens with small (4KB) writes is that all these small
> writes do not coalesce sufficiently before being written to the object on
> the OSD. 
> So up to 1000 4KB writes could happen to that 4MB object (clearly is it
> much less than that, but how much I can't tell), resulting in the same
> "blocks" being rewritten several times.

Ok, If understand well, with replication == 2 and journals in the same disks
of the OSDs (I assume that we are talking about storage via block device):

1. in theory there is a "write" amplification (between the client side and the
OSDs backend side) equal to 2 x #replication = 4, because data is written in the
journal and after in the OSD storage.

2. but in practice you notice that this factor of amplification depends
on the write size and, for instance, with lot of little I/O in the client
side, you notice sometimes an factor of 5 or 6 (instead of 4) because in
the OSDs, blocks are rewritten several times (a little write in the client
side can trigger bigger write in the OSD backend storage).

Did I well summarize the phenomenon?

> There's also the journaling done by the respective file system (I used
> ext4 during that test) and while there are bound to be some differences in
> a worst case scenario that could result in another 2x write amplification
> (FS journal and actual file).
> 
> In addition Ceph updates various files like the omap leveldb and
> meta-data, quantifying that however would require much more detailed
> analysis or familiarity with the Ceph code.

Ok. Thank you very much Christian for taking the time to explain me
your experience. :)

Regards.
François Lafont

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com