Re: SSD Hardware recommendation

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Mon, 23 Mar 2015 07:36:48 +0100 (CET)

Hi,

Isn't it in the nature of ssd to have write amplication ?

Generaly, they have a erase block size of 128k,

so the worst case could be 128/4 = 32x write amplification.

(of course ssd algorithms and optimisations reduce this write amplification).

Now, it could be great to see if it's coming from osd journal or osd datas.

(not tested, but I think with journal and O_DSYNC writes, it can give use ssd write amplification)

----- Mail original -----
De: "Christian Balzer" <chibi@xxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Lundi 23 Mars 2015 03:11:39
Objet: Re:  SSD Hardware recommendation

On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: 

> Hi, 
> 
> Sorry Christian for my late answer. I was a little busy. 
> 
> Christian Balzer a wrote: 
> 
> > You're asking the wrong person, as I'm neither a Ceph or kernel 
> > developer. ^o^ 
> 
> No, no, the rest of the message proves to me that I talk to the 
> right person. ;) 
> 
> > Back then Mark Nelson from the Ceph team didn't expect to see those 
> > numbers as well, but both Mark Wu and I saw them. 
> > 
> > Anyways, lets start with the basics and things that are understandable 
> > without any detail knowledge. 
> > 
> > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 
> > (Since we're talking about SSD cluster here and keep things related to 
> > the question of the OP). 
> > 
> > Now a client writes 40MB of data to the cluster. 
> > Assuming an ideal scenario where all PGs are evenly distributed (they 
> > won't be) and this is totally fresh data (resulting in 10 4MB Ceph 
> > objects), this would mean that each OSD will receive 4MB (10 primary 
> > PGs, 10 secondary ones). 
> > With journals on the same SSD (currently the best way based on tests), 
> > we get a write amplification of 2, as that data is written both to the 
> > journal and the actual storage space. 
> > 
> > But as my results in the link above showed, that is very much 
> > dependent on the write size. With a 4MB block size (the ideal size for 
> > default RBD pools and objects) I saw even slightly less than the 2x 
> > amplifications expected, I assume that was due to caching and PG 
> > imbalances. 
> > 
> > Now my guess what happens with small (4KB) writes is that all these 
> > small writes do not coalesce sufficiently before being written to the 
> > object on the OSD. 
> > So up to 1000 4KB writes could happen to that 4MB object (clearly is it 
> > much less than that, but how much I can't tell), resulting in the same 
> > "blocks" being rewritten several times. 
> 
> Ok, If understand well, with replication == 2 and journals in the same 
> disks of the OSDs (I assume that we are talking about storage via block 
> device): 
> 
> 1. in theory there is a "write" amplification (between the client side 
> and the OSDs backend side) equal to 2 x #replication = 4, because data 
> is written in the journal and after in the OSD storage. 
> 

The write amplification on a cluster wide basis is like that, but we're 
only interested in the amplification on the OSD level (SSD wearout) and 
there it should be 2 (journal and data). 

Also keep in mind that both the OSD distribution isn't perfect and 
that you might have some very hot data (frequently written) that resides 
only in one Ceph object (4MB), so just on one PG and thus hitting only 3 
OSDs (replication of 3) all the time, while other OSDs see much less usage. 

In theory that should average out, in practice you might see quite some 
variations in how busy (data written) OSDs are. 

> 2. but in practice you notice that this factor of amplification depends 
> on the write size and, for instance, with lot of little I/O in the client 
> side, you notice sometimes an factor of 5 or 6 (instead of 4) because in 
> the OSDs, blocks are rewritten several times (a little write in the 
> client side can trigger bigger write in the OSD backend storage). 
> 
Well, the example setup had only one node and thus replication of 1, so the 
expected total write amplification was 2. To quote: 
--- 
Now if we run that same test with a block size of 4KB one gets: 
Total time run: 30.033196 
Total writes made: 126944 
Write size: 4096 
Bandwidth (MB/sec): 16.511 

This makes about 508MB written or roughly 64MB per OSD. 
According to the SMART values of the SSDs they wrote 768MB each or in 
other words 6 times more than one would have expected with a write 
amplification of 2. 
--- 

So what I expected was around 128MB per OSD (write amplification of 2), 
but what actually written was 768MB, an amplification of 12, 6 times more 
expected. 

> Did I well summarize the phenomenon? 
> 
> > There's also the journaling done by the respective file system (I used 
> > ext4 during that test) and while there are bound to be some 
> > differences in a worst case scenario that could result in another 2x 
> > write amplification (FS journal and actual file). 
> > 
> > In addition Ceph updates various files like the omap leveldb and 
> > meta-data, quantifying that however would require much more detailed 
> > analysis or familiarity with the Ceph code. 
> 
> Ok. Thank you very much Christian for taking the time to explain me 
> your experience. :) 
> 
If you look at the current "Cache Tier Flush = immediate base tier journal 
sync?" thread, there Greg kinda confirms that all these updates might be 
responsible. 

Regards, 

Christian 
-- 
Christian Balzer Network/Systems Engineer 
chibi@xxxxxxx Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com