Re: SSD Hardware recommendation

Christian Balzer <chibi@xxxxxxx> · Mon, 23 Mar 2015 11:11:39 +0900

On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote:

> Hi,
> 
> Sorry Christian for my late answer. I was a little busy.
> 
> Christian Balzer a wrote:
> 
> > You're asking the wrong person, as I'm neither a Ceph or kernel
> > developer. ^o^
> 
> No, no, the rest of the message proves to me that I talk to the
> right person. ;)
> 
> > Back then Mark Nelson from the Ceph team didn't expect to see those
> > numbers as well, but both Mark Wu and I saw them.
> > 
> > Anyways, lets start with the basics and things that are understandable
> > without any detail knowledge.
> > 
> > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2
> > (Since we're talking about SSD cluster here and keep things related to
> > the question of the OP).
> > 
> > Now a client writes 40MB of data to the cluster.
> > Assuming an ideal scenario where all PGs are evenly distributed (they
> > won't be) and this is totally fresh data (resulting in 10 4MB Ceph
> > objects), this would mean that each OSD will receive 4MB (10 primary
> > PGs, 10 secondary ones).
> > With journals on the same SSD (currently the best way based on tests),
> > we get a write amplification of 2, as that data is written both to the
> > journal and the actual storage space.
> > 
> > But as my results in the link above showed, that is very much
> > dependent on the write size. With a 4MB block size (the ideal size for
> > default RBD pools and objects) I saw even slightly less than the 2x
> > amplifications expected, I assume that was due to caching and PG
> > imbalances.
> > 
> > Now my guess what happens with small (4KB) writes is that all these
> > small writes do not coalesce sufficiently before being written to the
> > object on the OSD. 
> > So up to 1000 4KB writes could happen to that 4MB object (clearly is it
> > much less than that, but how much I can't tell), resulting in the same
> > "blocks" being rewritten several times.
> 
> Ok, If understand well, with replication == 2 and journals in the same
> disks of the OSDs (I assume that we are talking about storage via block
> device):
> 
> 1. in theory there is a "write" amplification (between the client side
> and the OSDs backend side) equal to 2 x #replication = 4, because data
> is written in the journal and after in the OSD storage.
> 

The write amplification on a cluster wide basis is like that, but we're
only interested in the amplification on the OSD level (SSD wearout) and
there it should be 2 (journal and data).

Also keep in mind that both the OSD distribution isn't perfect and
that you might have some very hot data (frequently written) that resides
only in one Ceph object (4MB), so just on one PG and thus hitting only 3
OSDs (replication of 3) all the time, while other OSDs see much less usage.

In theory that should average out, in practice you might see quite some
variations in how busy (data written) OSDs are.

> 2. but in practice you notice that this factor of amplification depends
> on the write size and, for instance, with lot of little I/O in the client
> side, you notice sometimes an factor of 5 or 6 (instead of 4) because in
> the OSDs, blocks are rewritten several times (a little write in the
> client side can trigger bigger write in the OSD backend storage).
> 
Well, the example setup had only one node and thus replication of 1, so the
expected total write amplification was 2. To quote:
---
Now if we run that same test with a block size of 4KB one gets:
Total time run:         30.033196
Total writes made:      126944
Write size:             4096
Bandwidth (MB/sec):     16.511 

This makes about 508MB written or roughly 64MB per OSD.
According to the SMART values of the SSDs they wrote 768MB each or in
other words 6 times more than one would have expected with a write
amplification of 2. 
---

So what I expected was around 128MB per OSD (write amplification of 2),
but what actually written was 768MB, an amplification of 12, 6 times more
expected.

> Did I well summarize the phenomenon?
> 
> > There's also the journaling done by the respective file system (I used
> > ext4 during that test) and while there are bound to be some
> > differences in a worst case scenario that could result in another 2x
> > write amplification (FS journal and actual file).
> > 
> > In addition Ceph updates various files like the omap leveldb and
> > meta-data, quantifying that however would require much more detailed
> > analysis or familiarity with the Ceph code.
> 
> Ok. Thank you very much Christian for taking the time to explain me
> your experience. :)
> 
If you look at the current "Cache Tier Flush = immediate base tier journal
sync?" thread, there Greg kinda confirms that all these updates might be
responsible. 

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com