SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan Van Der Ster) · Sat, 6 Sep 2014 16:28:29 +0000

RAID5... Hadn't considered it due to the IOPS penalty (it would get 1/4th of the IOPS of separated journal devices, according to some online raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but lower performance.

After the anecdotes that the DCS3700 is very rarely failing, and without a stable bcache to build upon, I'm leaning toward the usual 5 journal partitions per SSD. But that will leave at least 100GB free per drive, so I might try running an OSD there.

Cheers, Dan

On Sep 6, 2014 6:07 PM, Scott Laird <scott at sigkill.org> wrote:
Backing up slightly, have you considered RAID 5 over your SSDs?  Practically speaking, there's no performance downside to RAID 5 when your devices aren't IOPS-bound.

On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer <chibi at gol.com<mailto:chibi at gol.com>> wrote:
On Sat, 6 Sep 2014 14:50:20 +0000 Dan van der Ster wrote:

> September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com<mailto:chibi at gol.com>> wrote:
> > On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:
> >
> >> Hi Christian,
> >>
> >> Let's keep debating until a dev corrects us ;)
> >
> > For the time being, I give the recent:
> >
> > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html
> >
> > And not so recent:
> > http://www.spinics.net/lists/ceph-users/msg04152.html
> > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> >
> > And I'm not going to use BTRFS for mainly RBD backed VM images
> > (fragmentation city), never mind the other stability issues that crop
> > up here ever so often.
>
>
> Thanks for the links... So until I learn otherwise, I better assume the
> OSD is lost when the journal fails. Even though I haven't understood
> exactly why :( I'm going to UTSL to understand the consistency better.
> An op state diagram would help, but I didn't find one yet.
>
Using the source as an option of last resort is always nice, having to
actually do so for something like this feels a bit lacking in the
documentation department (that or my google foo being weak). ^o^

> BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> failed, are any object inconsistencies going to be found by a
> scrub/deep-scrub?
>
No idea.
And really a scenario I hope to never encounter. ^^;;

> >>
> >> We have 4 servers in a 3U rack, then each of those servers is
> >> connected to one of these enclosures with a single SAS cable.
> >>
> >>>> With the current config, when I dd to all drives in parallel I can
> >>>> write at 24*74MB/s = 1776MB/s.
> >>>
> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> >>> lanes, so as far as that bus goes, it can do 4GB/s.
> >>> And given your storage pod I assume it is connected with 2 mini-SAS
> >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> >>> bandwidth.
> >>
> >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> >
> > Alright, that explains that then. Any reason for not using both ports?
> >
>
> Probably to minimize costs, and since the single 10Gig-E is a bottleneck
> anyway. The whole thing is suboptimal anyway, since this hardware was
> not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
>
The single 10Gb/s link is the bottleneck for sustained stuff, but when
looking at spikes...
Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
might also get some loving. ^o^

The cluster I'm currently building is based on storage nodes with 4 SSDs
(100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
redundancy, not speed. ^^

> >>> Impressive, even given your huge cluster with 1128 OSDs.
> >>> However that's not really answering my question, how much data is on
> >>> an average OSD and thus gets backfilled in that hour?
> >>
> >> That's true -- our drives have around 300TB on them. So I guess it
> >> will take longer - 3x longer - when the drives are 1TB full.
> >
> > On your slides, when the crazy user filled the cluster with 250 million
> > objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> >
>
> Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> close to 1PB. The point was that to fill the cluster with RBD, we'd need
> 250 million (4MB) objects. So, object-count-wise this was a full
> cluster, but for the real volume it was more like 70TB IIRC (there were
> some other larger objects too).
>
Ah, I see. ^^

> In that case, the backfilling was CPU-bound, or perhaps
> wbthrottle-bound, I don't remember... It was just that there were many
> tiny tiny objects to synchronize.
>
Indeed. This is something me and others have seen as well, as in
backfilling being much slower than the underlying HW would permit and
being CPU intensive.

> > Anyway, I guess the lesson to take away from this is that size and
> > parallelism does indeed help, but even in a cluster like yours
> > recovering from a 2TB loss would likely be in the 10 hour range...
>
> Bigger clusters probably backfill faster simply because there are more
> OSDs involved in the backfilling. In our cluster we initially get 30-40
> backfills in parallel after 1 OSD fails. That's even with max backfills
> = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is
> spent backfilling the last 20% of the PGs, just because some OSDs
> randomly get more new PGs than the others.
>
You still being on dumpling probably doesn't help that uneven distribution
bit.
Definitely another data point to go into a realistic recovery/reliability
model, though.

Christian

> > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^
>
> Marked that one to read, again.
>
> Cheers, dan
>

--
Christian Balzer        Network/Systems Engineer
chibi at gol.com<mailto:chibi at gol.com>           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140906/984c567c/attachment.htm>