On Mon, 23 Mar 2015 11:51:56 +0100 (CET) Alexandre DERUMIER wrote: > >> the combination of all the > >>things mentioned before in the Ceph/FS stack caused a 12x amplification > >>(instead of 2x) _before_ hitting the SSD. > > oh, ok, pretty strange. > > BTW, is it through ceph-fs ? or rbd/rados ? > See the link below, it was rados bench. But anything that would generate small writes would cause this, I bet. > ----- Mail original ----- > De: "Christian Balzer" <chibi@xxxxxxx> > À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > Cc: "aderumier" <aderumier@xxxxxxxxx> > Envoyé: Lundi 23 Mars 2015 08:29:03 > Objet: Re: SSD Hardware recommendation > > Hello, > > Again refer to my original, old mail: > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html > > I was strictly looking at the SMART values, in the case of these > Intel DC S3700 SSDs the "Host_Writes_32MiB" values. > Which, according to what name implies and what all the references I > could find means exactly that, the writes from the host (the SATA > controller) to the actual SSD. > So no matter what optimizations the SSD does itself and what other > things might be possible with things like O_DSYNC, the combination of > all the things mentioned before in the Ceph/FS stack caused a 12x > amplification (instead of 2x) _before_ hitting the SSD. > > And that's where optimizations in Ceph and other components, maybe > avoiding a FS altogether, will be very helpful and welcome. > > Regards, > > Christian > > On Mon, 23 Mar 2015 07:49:41 +0100 (CET) Alexandre DERUMIER wrote: > > > >>(not tested, but I think with journal and O_DSYNC writes, it can > > >>give use ssd write amplification) > > > > also, I think that enterprise ssd with supercapacitor, should be able > > to cache theses o_dsync writes in the ssd buffer, and do bigger writes > > to reduce amplification. > > > > Don't known how ssd internal algorithms work for this. > > > > > > ----- Mail original ----- > > De: "aderumier" <aderumier@xxxxxxxxx> > > À: "Christian Balzer" <chibi@xxxxxxx> > > Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > > Envoyé: Lundi 23 Mars 2015 07:36:48 > > Objet: Re: SSD Hardware recommendation > > > > Hi, > > > > Isn't it in the nature of ssd to have write amplication ? > > > > Generaly, they have a erase block size of 128k, > > > > so the worst case could be 128/4 = 32x write amplification. > > > > (of course ssd algorithms and optimisations reduce this write > > amplification). > > > > Now, it could be great to see if it's coming from osd journal or osd > > datas. > > > > (not tested, but I think with journal and O_DSYNC writes, it can give > > use ssd write amplification) > > > > > > ----- Mail original ----- > > De: "Christian Balzer" <chibi@xxxxxxx> > > À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > > Envoyé: Lundi 23 Mars 2015 03:11:39 > > Objet: Re: SSD Hardware recommendation > > > > On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: > > > > > Hi, > > > > > > Sorry Christian for my late answer. I was a little busy. > > > > > > Christian Balzer a wrote: > > > > > > > You're asking the wrong person, as I'm neither a Ceph or kernel > > > > developer. ^o^ > > > > > > No, no, the rest of the message proves to me that I talk to the > > > right person. ;) > > > > > > > Back then Mark Nelson from the Ceph team didn't expect to see > > > > those numbers as well, but both Mark Wu and I saw them. > > > > > > > > Anyways, lets start with the basics and things that are > > > > understandable without any detail knowledge. > > > > > > > > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 > > > > (Since we're talking about SSD cluster here and keep things > > > > related to the question of the OP). > > > > > > > > Now a client writes 40MB of data to the cluster. > > > > Assuming an ideal scenario where all PGs are evenly distributed > > > > (they won't be) and this is totally fresh data (resulting in 10 > > > > 4MB Ceph objects), this would mean that each OSD will receive 4MB > > > > (10 primary PGs, 10 secondary ones). > > > > With journals on the same SSD (currently the best way based on > > > > tests), we get a write amplification of 2, as that data is written > > > > both to the journal and the actual storage space. > > > > > > > > But as my results in the link above showed, that is very much > > > > dependent on the write size. With a 4MB block size (the ideal size > > > > for default RBD pools and objects) I saw even slightly less than > > > > the 2x amplifications expected, I assume that was due to caching > > > > and PG imbalances. > > > > > > > > Now my guess what happens with small (4KB) writes is that all > > > > these small writes do not coalesce sufficiently before being > > > > written to the object on the OSD. > > > > So up to 1000 4KB writes could happen to that 4MB object (clearly > > > > is it much less than that, but how much I can't tell), resulting > > > > in the same "blocks" being rewritten several times. > > > > > > Ok, If understand well, with replication == 2 and journals in the > > > same disks of the OSDs (I assume that we are talking about storage > > > via block device): > > > > > > 1. in theory there is a "write" amplification (between the client > > > side and the OSDs backend side) equal to 2 x #replication = 4, > > > because data is written in the journal and after in the OSD storage. > > > > > > > The write amplification on a cluster wide basis is like that, but > > we're only interested in the amplification on the OSD level (SSD > > wearout) and there it should be 2 (journal and data). > > > > Also keep in mind that both the OSD distribution isn't perfect and > > that you might have some very hot data (frequently written) that > > resides only in one Ceph object (4MB), so just on one PG and thus > > hitting only 3 OSDs (replication of 3) all the time, while other OSDs > > see much less usage. > > > > In theory that should average out, in practice you might see quite > > some variations in how busy (data written) OSDs are. > > > > > 2. but in practice you notice that this factor of amplification > > > depends on the write size and, for instance, with lot of little I/O > > > in the client side, you notice sometimes an factor of 5 or 6 > > > (instead of 4) because in the OSDs, blocks are rewritten several > > > times (a little write in the client side can trigger bigger write in > > > the OSD backend storage). > > > > > Well, the example setup had only one node and thus replication of 1, > > so the expected total write amplification was 2. To quote: > > --- > > Now if we run that same test with a block size of 4KB one gets: > > Total time run: 30.033196 > > Total writes made: 126944 > > Write size: 4096 > > Bandwidth (MB/sec): 16.511 > > > > This makes about 508MB written or roughly 64MB per OSD. > > According to the SMART values of the SSDs they wrote 768MB each or in > > other words 6 times more than one would have expected with a write > > amplification of 2. > > --- > > > > So what I expected was around 128MB per OSD (write amplification of > > 2), but what actually written was 768MB, an amplification of 12, 6 > > times more expected. > > > > > Did I well summarize the phenomenon? > > > > > > > There's also the journaling done by the respective file system (I > > > > used ext4 during that test) and while there are bound to be some > > > > differences in a worst case scenario that could result in another > > > > 2x write amplification (FS journal and actual file). > > > > > > > > In addition Ceph updates various files like the omap leveldb and > > > > meta-data, quantifying that however would require much more > > > > detailed analysis or familiarity with the Ceph code. > > > > > > Ok. Thank you very much Christian for taking the time to explain me > > > your experience. :) > > > > > If you look at the current "Cache Tier Flush = immediate base tier > > journal sync?" thread, there Greg kinda confirms that all these > > updates might be responsible. > > > > Regards, > > > > Christian > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com