SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Sat, 6 Sep 2014 23:01:13 +0900

On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:

> Hi Christian,
> 
> Let's keep debating until a dev corrects us ;)
> 
For the time being, I give the recent:

https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html

And not so recent:
http://www.spinics.net/lists/ceph-users/msg04152.html
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021

And I'm not going to use BTRFS for mainly RBD backed VM images
(fragmentation city), never mind the other stability issues that crop up
here ever so often.

> September 6 2014 1:27 PM, "Christian Balzer" <chibi at gol.com> wrote: 
> > On Fri, 5 Sep 2014 09:42:02 +0000 Dan Van Der Ster wrote:
> > 
> >>> On 05 Sep 2014, at 11:04, Christian Balzer <chibi at gol.com> wrote:
> >>> 
> >>> On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:
> >>>> 
> >>>>> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
> >>>>> 
> >>>>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> >>>>> 
> >>>>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> >>>>>> <daniel.vanderster at cern.ch> wrote:
> >>>>>> 
> > 
> > [snip]
> > 
> >>>>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how
> >>>>>>> painful is the backfilling which results from an SSD failure?
> >>>>>>> Have you considered tricks like increasing the down out interval
> >>>>>>> so backfilling doesn?t happen in this case (leaving time for the
> >>>>>>> SSD to be replaced)?
> >>>>>>> 
> >>>>>> 
> >>>>>> Replacing a failed SSD won't help your backfill. I haven't
> >>>>>> actually tested it, but I'm pretty sure that losing the journal
> >>>>>> effectively corrupts your OSDs. I don't know what steps are
> >>>>>> required to complete this operation, but it wouldn't surprise me
> >>>>>> if you need to re-format the OSD.
> >>>>>> 
> >>>>> This.
> >>>>> All the threads I've read about this indicate that journal loss
> >>>>> during operation means OSD loss. Total OSD loss, no recovery.
> >>>>> From what I gathered the developers are aware of this and it might
> >>>>> be addressed in the future.
> >>>>> 
> >>>> 
> >>>> I suppose I need to try it then. I don?t understand why you can't
> >>>> just use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for
> >>>> example.
> >>>> 
> >>> I think the logic is if you shut down an OSD cleanly beforehand you
> >>> can just do that.
> >>> However from what I gathered there is no logic to re-issue
> >>> transactions that made it to the journal but not the filestore.
> >>> So a journal SSD failing mid-operation with a busy OSD would
> >>> certainly be in that state.
> >>> 
> >> 
> >> I had thought that the journal write and the buffered filestore write
> >> happen at the same time.
> > 
> > Nope, definitely not.
> > 
> > That's why we have tunables like the ones at:
> > http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals
> > 
> > And people (me included) tend to crank that up (to eleven ^o^).
> > 
> > The write-out to the filestore may start roughly at the same time as
> > the journal gets things, but it can and will fall behind.
> > 
> 
> filestore max sync interval is the period between the fsync/fdatasync's
> of the outstanding filestore writes, which were sent earlier. By the
> time the sync interval arrives, the OS may have already flushed those
> writes (sysctl's like vm.dirty_ratio, dirty_expire_centisecs, ... apply
> here). And even if the osd crashes and never calls fsync, then the OS
> will flush those anyway. Of course, if a power outage prevents the fsync
> from ever happening, then the journal entry replay is used to re-write
> the op. The other thing about filestore max sync interval is that
> journal entries are only free'd after the osd has fsync'd the related
> filestore write. That's why the journal size depends on the sync
> interval.
> 
> 
> >> So all the previous journal writes that
> >> succeeded are already on their way to the filestore. My (could be
> >> incorrect) understanding is that the real purpose of the journal is to
> >> be able to replay writes after a power outage (since the buffered
> >> filestore writes would be lost in that case). If there is no power
> >> outage, then filestore writes are still good regardless of a journal
> >> failure.
> > 
> > From Cephs perspective a write is successful once it is on all replica
> > size journals.
> 
> This is the key point - which I'm not sure about and don't feel like
> reading the code on a Saturday ;) Is a write ack'd after a successful
> journal write, or after the journal _and_ the buffered filestore writes?
> Is that documented somewhere?
> 
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

Search for "acknowledgement" if you don't want to read the full thing. ^o^

> 
> > I think (hope) that what you wrote up there to be true, but that
> > doesn't change the fact that journal data not even on the way to the
> > filestore yet is the crux here.
> > 
> >>> I'm sure (hope) somebody from the Ceph team will pipe up about this.
> >> 
> >> Ditto!
> > 
> > Guess it will be next week...
> > 
> >>>>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
> >>>>> ratio is sensible. However these will be the ones limiting your max
> >>>>> sequential write speed if that is of importance to you. In nearly
> >>>>> all use cases you run out of IOPS (on your HDDs) long before that
> >>>>> becomes an issue, though.
> >>>> 
> >>>> IOPS is definitely the main limit, but we also only have 1 single
> >>>> 10Gig-E NIC on these servers, so 4 drives that can write (even only
> >>>> 200MB/s) would be good enough.
> >>>> 
> >>> Fair enough. ^o^
> >>> 
> >>>> Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA
> >>>> which is shared with the other 20 spinning disks. Counting the
> >>>> double writes, the HBA will run out of bandwidth before these SSDs,
> >>>> I expect.
> >>>> 
> >>> Depends on what PCIe slot it is and so forth. A 2008 should give you
> >>> 4GB/s, enough to keep the SSDs happy at least. ^o^
> >>> 
> >>> A 2008 has only 8 SAS/SATA ports, so are you using port expanders on
> >>> your case backplane?
> >>> In that case you might want to spread the SSDs out over channels, as
> >>> in have 3 HDDs sharing one channel with one SSD.
> >> 
> >> We use a Promise VTrak J830sS, and now I?ll got ask our hardware team
> >> if there would be any benefit to store the SSDs row or column wise.
> > 
> > Ah, a storage pod. So you have that and a real OSD head server,
> > something like a 1U machine or Supermicro Twin?
> > Looking at the specs of it I would assume 3 drive per expander, so
> > having one SSD mixed with 2 HDDs should definitely be beneficial.
> > 
> 
> We have 4 servers in a 3U rack, then each of those servers is connected
> to one of these enclosures with a single SAS cable. 
> 
> >> With the current config, when I dd to all drives in parallel I can
> >> write at 24*74MB/s = 1776MB/s.
> > 
> > That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> > lanes, so as far as that bus goes, it can do 4GB/s.
> > And given your storage pod I assume it is connected with 2 mini-SAS
> > cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> > bandwidth.
> 
> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> 
Alright, that explains that then. Any reason for not using both ports?

> > 
> > How fast can your "eco 5900rpm" drives write individually?
> > If it is significantly more than 74MB/s (I couldn't find any specs or
> > reviews of those drives on the net), I would really want to know where
> > that bottleneck is.
> 
> Around ~120MB/s up to around 4-5 drives. Then the per-drive speed starts
> decreasing to the low of 74MB/s when all 24 are used.
> 
With the HDDs the impact isn't that bad, since you're likely to be IOPS
(seek, seek and seek again) bound with them anyway. 
But once the SSDs enter the picture getting that 2nd cable in place (if
possible) might be very well worth it.

[snip]
> >>>>> c) Configure the various backfill options to have only a small
> >>>>> impact. Journal SSDs will improve things compared to your current
> >>>>> situation. And if I recall correctly, you're using a replica size
> >>>>> of 3 to 4, so you can afford a more sedate recovery.
> >>>> 
> >>>> It?s already at 1 backfill, 1 recovery, and the lowest queue
> >>>> priority (1/63) for recovery IOs.
> >>>> 
> >>> So how long does that take you to recover 1TB then in the case of a
> >>> single OSD failure?
> >> 
> >> Single OSD failures take us ~1 hour to backfill. The 24 OSD failure
> >> took ~2 hours to backfill.
> > 
> > Impressive, even given your huge cluster with 1128 OSDs.
> > However that's not really answering my question, how much data is on an
> > average OSD and thus gets backfilled in that hour?
> 
> That's true -- our drives have around 300TB on them. So I guess it will
> take longer - 3x longer - when the drives are 1TB full.
> 

On your slides, when the crazy user filled the cluster with 250 million
objects and thus 1PB of data, I recall seeing a 7 hour backfill time?

Anyway, I guess the lesson to take away from this is that size and
parallelism does indeed help, but even in a cluster like yours recovering
from a 2TB loss would likely be in the 10 hour range...

Again, see the "Best practice K/M-parameters EC pool" thread. ^.^

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/