SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Tue, 9 Sep 2014 11:36:50 +0900

On Tue, 9 Sep 2014 01:40:42 +0000 Quenten Grasso wrote:

> This reminds me of something I was trying to find out awhile back.
> 
> If we have 2000 "Random" IOPS of which are 4K Blocks our cluster
> (assuming 3 x Replicas) will generate 6000 IOPS @ 4K onto the journals.
> 
> Does this mean our Journals will absorb 6000 IOPS and turn these into X
> IOPS onto our spindles? 
> 
In theory, yes.

> If this is the case Is it possible to calculate how many IOPS a journal
> would "absorb" and how this would translate to x IOPS on spindle disk?
> 
It very much depends, there are a number of configuration parameters that
will influence this and what those IOPS actually are.

As an example, with "rados -p rbd bench 30 write -t 32 -b 4096" I see a
ratio of 3:1 on a cluster here, as measured with the ole mark 1 eyeball
and atop or iostat.

Christian
> Regards,
> Quenten Grasso
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
> Christian Balzer Sent: Sunday, 7 September 2014 1:38 AM
> To: ceph-users
> Subject: Re: SSD journal deployment experiences
> 
> On Sat, 6 Sep 2014 14:50:20 +0000 Dan van der Ster wrote:
> 
> > September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com> wrote: 
> > > On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:
> > > 
> > >> Hi Christian,
> > >> 
> > >> Let's keep debating until a dev corrects us ;)
> > > 
> > > For the time being, I give the recent:
> > > 
> > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html
> > > 
> > > And not so recent:
> > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > > 
> > > And I'm not going to use BTRFS for mainly RBD backed VM images 
> > > (fragmentation city), never mind the other stability issues that 
> > > crop up here ever so often.
> > 
> > 
> > Thanks for the links... So until I learn otherwise, I better assume 
> > the OSD is lost when the journal fails. Even though I haven't 
> > understood exactly why :( I'm going to UTSL to understand the
> > consistency better. An op state diagram would help, but I didn't find
> > one yet.
> > 
> Using the source as an option of last resort is always nice, having to
> actually do so for something like this feels a bit lacking in the
> documentation department (that or my google foo being weak). ^o^
> 
> > BTW, do you happen to know, _if_ we re-use an OSD after the journal 
> > has failed, are any object inconsistencies going to be found by a 
> > scrub/deep-scrub?
> > 
> No idea. 
> And really a scenario I hope to never encounter. ^^;;
> 
> > >> 
> > >> We have 4 servers in a 3U rack, then each of those servers is 
> > >> connected to one of these enclosures with a single SAS cable.
> > >> 
> > >>>> With the current config, when I dd to all drives in parallel I 
> > >>>> can write at 24*74MB/s = 1776MB/s.
> > >>> 
> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 
> > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> > >>> And given your storage pod I assume it is connected with 2 
> > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s 
> > >>> SATA bandwidth.
> > >> 
> > >> From above, we are only using 4 lanes -- so around 2GB/s is
> > >> expected.
> > > 
> > > Alright, that explains that then. Any reason for not using both
> > > ports?
> > > 
> > 
> > Probably to minimize costs, and since the single 10Gig-E is a 
> > bottleneck anyway. The whole thing is suboptimal anyway, since this 
> > hardware was not purchased for Ceph to begin with. Hence retrofitting
> > SSDs, etc...
> >
> The single 10Gb/s link is the bottleneck for sustained stuff, but when
> looking at spikes... Oh well, I guess if you ever connect that 2nd 10GbE
> card that 2nd port might also get some loving. ^o^
> 
> The cluster I'm currently building is based on storage nodes with 4 SSDs
> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and
> 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
> redundancy, not speed. ^^ 
> > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > >>> However that's not really answering my question, how much data is 
> > >>> on an average OSD and thus gets backfilled in that hour?
> > >> 
> > >> That's true -- our drives have around 300TB on them. So I guess it 
> > >> will take longer - 3x longer - when the drives are 1TB full.
> > > 
> > > On your slides, when the crazy user filled the cluster with 250 
> > > million objects and thus 1PB of data, I recall seeing a 7 hour
> > > backfill time?
> > > 
> > 
> > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not 
> > close to 1PB. The point was that to fill the cluster with RBD, we'd 
> > need
> > 250 million (4MB) objects. So, object-count-wise this was a full 
> > cluster, but for the real volume it was more like 70TB IIRC (there 
> > were some other larger objects too).
> > 
> Ah, I see. ^^
> 
> > In that case, the backfilling was CPU-bound, or perhaps 
> > wbthrottle-bound, I don't remember... It was just that there were many 
> > tiny tiny objects to synchronize.
> > 
> Indeed. This is something me and others have seen as well, as in
> backfilling being much slower than the underlying HW would permit and
> being CPU intensive.
> 
> > > Anyway, I guess the lesson to take away from this is that size and 
> > > parallelism does indeed help, but even in a cluster like yours 
> > > recovering from a 2TB loss would likely be in the 10 hour range...
> > 
> > Bigger clusters probably backfill faster simply because there are more 
> > OSDs involved in the backfilling. In our cluster we initially get 
> > 30-40 backfills in parallel after 1 OSD fails. That's even with max 
> > backfills = 1. The backfilling sorta follows an 80/20 rule -- 80% of 
> > the time is spent backfilling the last 20% of the PGs, just because 
> > some OSDs randomly get more new PGs than the others.
> > 
> You still being on dumpling probably doesn't help that uneven
> distribution bit. Definitely another data point to go into a realistic
> recovery/reliability model, though.
> 
> Christian
> 
> > > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^
> > 
> > Marked that one to read, again.
> > 
> > Cheers, dan
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/