SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Sun, 7 Sep 2014 01:27:10 +0900

On Sat, 06 Sep 2014 16:06:56 +0000 Scott Laird wrote:

> Backing up slightly, have you considered RAID 5 over your SSDs?
>  Practically speaking, there's no performance downside to RAID 5 when
> your devices aren't IOPS-bound.
> 

Well...
For starters with RAID5 you would loose 25% throughput in both Dan's and
my case (4 SSDs) compared to JBOD SSD journals. 
In Dan's case that might not matter due to other bottlenecks, in my case
it certainly would.

And while you're quite correct when it comes to IOPS, doing RAID5 will
either consume significant CPU resource in a software RAID case or require
a decent HW RAID controller. 

Christian

> On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer <chibi at gol.com> wrote:
> 
> > On Sat, 6 Sep 2014 14:50:20 +0000 Dan van der Ster wrote:
> >
> > > September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com> wrote:
> > > > On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:
> > > >
> > > >> Hi Christian,
> > > >>
> > > >> Let's keep debating until a dev corrects us ;)
> > > >
> > > > For the time being, I give the recent:
> > > >
> > > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html
> > > >
> > > > And not so recent:
> > > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > > >
> > > > And I'm not going to use BTRFS for mainly RBD backed VM images
> > > > (fragmentation city), never mind the other stability issues that
> > > > crop up here ever so often.
> > >
> > >
> > > Thanks for the links... So until I learn otherwise, I better assume
> > > the OSD is lost when the journal fails. Even though I haven't
> > > understood exactly why :( I'm going to UTSL to understand the
> > > consistency better. An op state diagram would help, but I didn't
> > > find one yet.
> > >
> > Using the source as an option of last resort is always nice, having to
> > actually do so for something like this feels a bit lacking in the
> > documentation department (that or my google foo being weak). ^o^
> >
> > > BTW, do you happen to know, _if_ we re-use an OSD after the journal
> > > has failed, are any object inconsistencies going to be found by a
> > > scrub/deep-scrub?
> > >
> > No idea.
> > And really a scenario I hope to never encounter. ^^;;
> >
> > > >>
> > > >> We have 4 servers in a 3U rack, then each of those servers is
> > > >> connected to one of these enclosures with a single SAS cable.
> > > >>
> > > >>>> With the current config, when I dd to all drives in parallel I
> > > >>>> can write at 24*74MB/s = 1776MB/s.
> > > >>>
> > > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe
> > > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> > > >>> And given your storage pod I assume it is connected with 2
> > > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 =
> > > >>> 48Gb/s SATA bandwidth.
> > > >>
> > > >> From above, we are only using 4 lanes -- so around 2GB/s is
> > > >> expected.
> > > >
> > > > Alright, that explains that then. Any reason for not using both
> > > > ports?
> > > >
> > >
> > > Probably to minimize costs, and since the single 10Gig-E is a
> > > bottleneck anyway. The whole thing is suboptimal anyway, since this
> > > hardware was not purchased for Ceph to begin with. Hence
> > > retrofitting SSDs, etc...
> > >
> > The single 10Gb/s link is the bottleneck for sustained stuff, but when
> > looking at spikes...
> > Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
> > might also get some loving. ^o^
> >
> > The cluster I'm currently building is based on storage nodes with 4
> > SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed
> > limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual
> > switch for redundancy, not speed. ^^
> >
> > > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > > >>> However that's not really answering my question, how much data
> > > >>> is on an average OSD and thus gets backfilled in that hour?
> > > >>
> > > >> That's true -- our drives have around 300TB on them. So I guess it
> > > >> will take longer - 3x longer - when the drives are 1TB full.
> > > >
> > > > On your slides, when the crazy user filled the cluster with 250
> > > > million objects and thus 1PB of data, I recall seeing a 7 hour
> > > > backfill time?
> > > >
> > >
> > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> > > close to 1PB. The point was that to fill the cluster with RBD, we'd
> > > need 250 million (4MB) objects. So, object-count-wise this was a full
> > > cluster, but for the real volume it was more like 70TB IIRC (there
> > > were some other larger objects too).
> > >
> > Ah, I see. ^^
> >
> > > In that case, the backfilling was CPU-bound, or perhaps
> > > wbthrottle-bound, I don't remember... It was just that there were
> > > many tiny tiny objects to synchronize.
> > >
> > Indeed. This is something me and others have seen as well, as in
> > backfilling being much slower than the underlying HW would permit and
> > being CPU intensive.
> >
> > > > Anyway, I guess the lesson to take away from this is that size and
> > > > parallelism does indeed help, but even in a cluster like yours
> > > > recovering from a 2TB loss would likely be in the 10 hour range...
> > >
> > > Bigger clusters probably backfill faster simply because there are
> > > more OSDs involved in the backfilling. In our cluster we initially
> > > get 30-40 backfills in parallel after 1 OSD fails. That's even with
> > > max backfills = 1. The backfilling sorta follows an 80/20 rule --
> > > 80% of the time is spent backfilling the last 20% of the PGs, just
> > > because some OSDs randomly get more new PGs than the others.
> > >
> > You still being on dumpling probably doesn't help that uneven
> > distribution bit.
> > Definitely another data point to go into a realistic
> > recovery/reliability model, though.
> >
> > Christian
> >
> > > > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^
> > >
> > > Marked that one to read, again.
> > >
> > > Cheers, dan
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/