Re: SSD Hardware recommendation

Christian Balzer <chibi@xxxxxxx> · Mon, 23 Mar 2015 23:04:09 +0900



On Mon, 23 Mar 2015 11:51:56 +0100 (CET) Alexandre DERUMIER wrote:

> >> the combination of all the
> >>things mentioned before in the Ceph/FS stack caused a 12x amplification
> >>(instead of 2x) _before_ hitting the SSD.
> 
> oh, ok, pretty strange.
> 
>  BTW, is it through ceph-fs ? or rbd/rados ?
> 
See the link below, it was rados bench.
But anything that would generate small writes would cause this, I bet.


> ----- Mail original -----
> De: "Christian Balzer" <chibi@xxxxxxx>
> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Cc: "aderumier" <aderumier@xxxxxxxxx>
> Envoyé: Lundi 23 Mars 2015 08:29:03
> Objet: Re:  SSD Hardware recommendation
> 
> Hello, 
> 
> Again refer to my original, old mail: 
> 
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html 
> 
> I was strictly looking at the SMART values, in the case of these 
> Intel DC S3700 SSDs the "Host_Writes_32MiB" values. 
> Which, according to what name implies and what all the references I
> could find means exactly that, the writes from the host (the SATA
> controller) to the actual SSD. 
> So no matter what optimizations the SSD does itself and what other
> things might be possible with things like O_DSYNC, the combination of
> all the things mentioned before in the Ceph/FS stack caused a 12x
> amplification (instead of 2x) _before_ hitting the SSD. 
> 
> And that's where optimizations in Ceph and other components, maybe 
> avoiding a FS altogether, will be very helpful and welcome. 
> 
> Regards, 
> 
> Christian 
> 
> On Mon, 23 Mar 2015 07:49:41 +0100 (CET) Alexandre DERUMIER wrote: 
> 
> > >>(not tested, but I think with journal and O_DSYNC writes, it can
> > >>give use ssd write amplification) 
> > 
> > also, I think that enterprise ssd with supercapacitor, should be able
> > to cache theses o_dsync writes in the ssd buffer, and do bigger writes
> > to reduce amplification. 
> > 
> > Don't known how ssd internal algorithms work for this. 
> > 
> > 
> > ----- Mail original ----- 
> > De: "aderumier" <aderumier@xxxxxxxxx> 
> > À: "Christian Balzer" <chibi@xxxxxxx> 
> > Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
> > Envoyé: Lundi 23 Mars 2015 07:36:48 
> > Objet: Re:  SSD Hardware recommendation 
> > 
> > Hi, 
> > 
> > Isn't it in the nature of ssd to have write amplication ? 
> > 
> > Generaly, they have a erase block size of 128k, 
> > 
> > so the worst case could be 128/4 = 32x write amplification. 
> > 
> > (of course ssd algorithms and optimisations reduce this write 
> > amplification). 
> > 
> > Now, it could be great to see if it's coming from osd journal or osd 
> > datas. 
> > 
> > (not tested, but I think with journal and O_DSYNC writes, it can give 
> > use ssd write amplification) 
> > 
> > 
> > ----- Mail original ----- 
> > De: "Christian Balzer" <chibi@xxxxxxx> 
> > À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
> > Envoyé: Lundi 23 Mars 2015 03:11:39 
> > Objet: Re:  SSD Hardware recommendation 
> > 
> > On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: 
> > 
> > > Hi, 
> > > 
> > > Sorry Christian for my late answer. I was a little busy. 
> > > 
> > > Christian Balzer a wrote: 
> > > 
> > > > You're asking the wrong person, as I'm neither a Ceph or kernel 
> > > > developer. ^o^ 
> > > 
> > > No, no, the rest of the message proves to me that I talk to the 
> > > right person. ;) 
> > > 
> > > > Back then Mark Nelson from the Ceph team didn't expect to see
> > > > those numbers as well, but both Mark Wu and I saw them. 
> > > > 
> > > > Anyways, lets start with the basics and things that are 
> > > > understandable without any detail knowledge. 
> > > > 
> > > > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 
> > > > (Since we're talking about SSD cluster here and keep things
> > > > related to the question of the OP). 
> > > > 
> > > > Now a client writes 40MB of data to the cluster. 
> > > > Assuming an ideal scenario where all PGs are evenly distributed 
> > > > (they won't be) and this is totally fresh data (resulting in 10
> > > > 4MB Ceph objects), this would mean that each OSD will receive 4MB
> > > > (10 primary PGs, 10 secondary ones). 
> > > > With journals on the same SSD (currently the best way based on 
> > > > tests), we get a write amplification of 2, as that data is written 
> > > > both to the journal and the actual storage space. 
> > > > 
> > > > But as my results in the link above showed, that is very much 
> > > > dependent on the write size. With a 4MB block size (the ideal size 
> > > > for default RBD pools and objects) I saw even slightly less than
> > > > the 2x amplifications expected, I assume that was due to caching
> > > > and PG imbalances. 
> > > > 
> > > > Now my guess what happens with small (4KB) writes is that all
> > > > these small writes do not coalesce sufficiently before being
> > > > written to the object on the OSD. 
> > > > So up to 1000 4KB writes could happen to that 4MB object (clearly
> > > > is it much less than that, but how much I can't tell), resulting
> > > > in the same "blocks" being rewritten several times. 
> > > 
> > > Ok, If understand well, with replication == 2 and journals in the
> > > same disks of the OSDs (I assume that we are talking about storage
> > > via block device): 
> > > 
> > > 1. in theory there is a "write" amplification (between the client
> > > side and the OSDs backend side) equal to 2 x #replication = 4,
> > > because data is written in the journal and after in the OSD storage. 
> > > 
> > 
> > The write amplification on a cluster wide basis is like that, but
> > we're only interested in the amplification on the OSD level (SSD
> > wearout) and there it should be 2 (journal and data). 
> > 
> > Also keep in mind that both the OSD distribution isn't perfect and 
> > that you might have some very hot data (frequently written) that
> > resides only in one Ceph object (4MB), so just on one PG and thus
> > hitting only 3 OSDs (replication of 3) all the time, while other OSDs
> > see much less usage. 
> > 
> > In theory that should average out, in practice you might see quite
> > some variations in how busy (data written) OSDs are. 
> > 
> > > 2. but in practice you notice that this factor of amplification 
> > > depends on the write size and, for instance, with lot of little I/O
> > > in the client side, you notice sometimes an factor of 5 or 6
> > > (instead of 4) because in the OSDs, blocks are rewritten several
> > > times (a little write in the client side can trigger bigger write in
> > > the OSD backend storage). 
> > > 
> > Well, the example setup had only one node and thus replication of 1,
> > so the expected total write amplification was 2. To quote: 
> > --- 
> > Now if we run that same test with a block size of 4KB one gets: 
> > Total time run: 30.033196 
> > Total writes made: 126944 
> > Write size: 4096 
> > Bandwidth (MB/sec): 16.511 
> > 
> > This makes about 508MB written or roughly 64MB per OSD. 
> > According to the SMART values of the SSDs they wrote 768MB each or in 
> > other words 6 times more than one would have expected with a write 
> > amplification of 2. 
> > --- 
> > 
> > So what I expected was around 128MB per OSD (write amplification of
> > 2), but what actually written was 768MB, an amplification of 12, 6
> > times more expected. 
> > 
> > > Did I well summarize the phenomenon? 
> > > 
> > > > There's also the journaling done by the respective file system (I 
> > > > used ext4 during that test) and while there are bound to be some 
> > > > differences in a worst case scenario that could result in another
> > > > 2x write amplification (FS journal and actual file). 
> > > > 
> > > > In addition Ceph updates various files like the omap leveldb and 
> > > > meta-data, quantifying that however would require much more
> > > > detailed analysis or familiarity with the Ceph code. 
> > > 
> > > Ok. Thank you very much Christian for taking the time to explain me 
> > > your experience. :) 
> > > 
> > If you look at the current "Cache Tier Flush = immediate base tier 
> > journal sync?" thread, there Greg kinda confirms that all these
> > updates might be responsible. 
> > 
> > Regards, 
> > 
> > Christian 
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com