Re: SSD Hardware recommendation

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Mon, 23 Mar 2015 11:51:56 +0100 (CET)

>> the combination of all the
>>things mentioned before in the Ceph/FS stack caused a 12x amplification
>>(instead of 2x) _before_ hitting the SSD.

oh, ok, pretty strange.

 BTW, is it through ceph-fs ? or rbd/rados ?

----- Mail original -----
De: "Christian Balzer" <chibi@xxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Cc: "aderumier" <aderumier@xxxxxxxxx>
Envoyé: Lundi 23 Mars 2015 08:29:03
Objet: Re:  SSD Hardware recommendation

Hello, 

Again refer to my original, old mail: 

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html 

I was strictly looking at the SMART values, in the case of these 
Intel DC S3700 SSDs the "Host_Writes_32MiB" values. 
Which, according to what name implies and what all the references I could 
find means exactly that, the writes from the host (the SATA controller) to 
the actual SSD. 
So no matter what optimizations the SSD does itself and what other things 
might be possible with things like O_DSYNC, the combination of all the 
things mentioned before in the Ceph/FS stack caused a 12x amplification 
(instead of 2x) _before_ hitting the SSD. 

And that's where optimizations in Ceph and other components, maybe 
avoiding a FS altogether, will be very helpful and welcome. 

Regards, 

Christian 

On Mon, 23 Mar 2015 07:49:41 +0100 (CET) Alexandre DERUMIER wrote: 

> >>(not tested, but I think with journal and O_DSYNC writes, it can give 
> >>use ssd write amplification) 
> 
> also, I think that enterprise ssd with supercapacitor, should be able to 
> cache theses o_dsync writes in the ssd buffer, and do bigger writes to 
> reduce amplification. 
> 
> Don't known how ssd internal algorithms work for this. 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier@xxxxxxxxx> 
> À: "Christian Balzer" <chibi@xxxxxxx> 
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
> Envoyé: Lundi 23 Mars 2015 07:36:48 
> Objet: Re:  SSD Hardware recommendation 
> 
> Hi, 
> 
> Isn't it in the nature of ssd to have write amplication ? 
> 
> Generaly, they have a erase block size of 128k, 
> 
> so the worst case could be 128/4 = 32x write amplification. 
> 
> (of course ssd algorithms and optimisations reduce this write 
> amplification). 
> 
> Now, it could be great to see if it's coming from osd journal or osd 
> datas. 
> 
> (not tested, but I think with journal and O_DSYNC writes, it can give 
> use ssd write amplification) 
> 
> 
> ----- Mail original ----- 
> De: "Christian Balzer" <chibi@xxxxxxx> 
> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
> Envoyé: Lundi 23 Mars 2015 03:11:39 
> Objet: Re:  SSD Hardware recommendation 
> 
> On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: 
> 
> > Hi, 
> > 
> > Sorry Christian for my late answer. I was a little busy. 
> > 
> > Christian Balzer a wrote: 
> > 
> > > You're asking the wrong person, as I'm neither a Ceph or kernel 
> > > developer. ^o^ 
> > 
> > No, no, the rest of the message proves to me that I talk to the 
> > right person. ;) 
> > 
> > > Back then Mark Nelson from the Ceph team didn't expect to see those 
> > > numbers as well, but both Mark Wu and I saw them. 
> > > 
> > > Anyways, lets start with the basics and things that are 
> > > understandable without any detail knowledge. 
> > > 
> > > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 
> > > (Since we're talking about SSD cluster here and keep things related 
> > > to the question of the OP). 
> > > 
> > > Now a client writes 40MB of data to the cluster. 
> > > Assuming an ideal scenario where all PGs are evenly distributed 
> > > (they won't be) and this is totally fresh data (resulting in 10 4MB 
> > > Ceph objects), this would mean that each OSD will receive 4MB (10 
> > > primary PGs, 10 secondary ones). 
> > > With journals on the same SSD (currently the best way based on 
> > > tests), we get a write amplification of 2, as that data is written 
> > > both to the journal and the actual storage space. 
> > > 
> > > But as my results in the link above showed, that is very much 
> > > dependent on the write size. With a 4MB block size (the ideal size 
> > > for default RBD pools and objects) I saw even slightly less than the 
> > > 2x amplifications expected, I assume that was due to caching and PG 
> > > imbalances. 
> > > 
> > > Now my guess what happens with small (4KB) writes is that all these 
> > > small writes do not coalesce sufficiently before being written to 
> > > the object on the OSD. 
> > > So up to 1000 4KB writes could happen to that 4MB object (clearly is 
> > > it much less than that, but how much I can't tell), resulting in the 
> > > same "blocks" being rewritten several times. 
> > 
> > Ok, If understand well, with replication == 2 and journals in the same 
> > disks of the OSDs (I assume that we are talking about storage via 
> > block device): 
> > 
> > 1. in theory there is a "write" amplification (between the client side 
> > and the OSDs backend side) equal to 2 x #replication = 4, because data 
> > is written in the journal and after in the OSD storage. 
> > 
> 
> The write amplification on a cluster wide basis is like that, but we're 
> only interested in the amplification on the OSD level (SSD wearout) and 
> there it should be 2 (journal and data). 
> 
> Also keep in mind that both the OSD distribution isn't perfect and 
> that you might have some very hot data (frequently written) that resides 
> only in one Ceph object (4MB), so just on one PG and thus hitting only 3 
> OSDs (replication of 3) all the time, while other OSDs see much less 
> usage. 
> 
> In theory that should average out, in practice you might see quite some 
> variations in how busy (data written) OSDs are. 
> 
> > 2. but in practice you notice that this factor of amplification 
> > depends on the write size and, for instance, with lot of little I/O in 
> > the client side, you notice sometimes an factor of 5 or 6 (instead of 
> > 4) because in the OSDs, blocks are rewritten several times (a little 
> > write in the client side can trigger bigger write in the OSD backend 
> > storage). 
> > 
> Well, the example setup had only one node and thus replication of 1, so 
> the expected total write amplification was 2. To quote: 
> --- 
> Now if we run that same test with a block size of 4KB one gets: 
> Total time run: 30.033196 
> Total writes made: 126944 
> Write size: 4096 
> Bandwidth (MB/sec): 16.511 
> 
> This makes about 508MB written or roughly 64MB per OSD. 
> According to the SMART values of the SSDs they wrote 768MB each or in 
> other words 6 times more than one would have expected with a write 
> amplification of 2. 
> --- 
> 
> So what I expected was around 128MB per OSD (write amplification of 2), 
> but what actually written was 768MB, an amplification of 12, 6 times 
> more expected. 
> 
> > Did I well summarize the phenomenon? 
> > 
> > > There's also the journaling done by the respective file system (I 
> > > used ext4 during that test) and while there are bound to be some 
> > > differences in a worst case scenario that could result in another 2x 
> > > write amplification (FS journal and actual file). 
> > > 
> > > In addition Ceph updates various files like the omap leveldb and 
> > > meta-data, quantifying that however would require much more detailed 
> > > analysis or familiarity with the Ceph code. 
> > 
> > Ok. Thank you very much Christian for taking the time to explain me 
> > your experience. :) 
> > 
> If you look at the current "Cache Tier Flush = immediate base tier 
> journal sync?" thread, there Greg kinda confirms that all these updates 
> might be responsible. 
> 
> Regards, 
> 
> Christian 

-- 
Christian Balzer Network/Systems Engineer 
chibi@xxxxxxx Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com