SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan van der Ster) · Sat, 6 Sep 2014 13:07:27 +0000

Hi Christian,

Let's keep debating until a dev corrects us ;)

September 6 2014 1:27 PM, "Christian Balzer" <chibi at gol.com> wrote: 
> On Fri, 5 Sep 2014 09:42:02 +0000 Dan Van Der Ster wrote:
> 
>>> On 05 Sep 2014, at 11:04, Christian Balzer <chibi at gol.com> wrote:
>>> 
>>> On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:
>>>> 
>>>>> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
>>>>> 
>>>>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
>>>>> 
>>>>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
>>>>>> <daniel.vanderster at cern.ch> wrote:
>>>>>> 
> 
> [snip]
> 
>>>>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful
>>>>>>> is the backfilling which results from an SSD failure? Have you
>>>>>>> considered tricks like increasing the down out interval so
>>>>>>> backfilling doesn?t happen in this case (leaving time for the SSD
>>>>>>> to be replaced)?
>>>>>>> 
>>>>>> 
>>>>>> Replacing a failed SSD won't help your backfill. I haven't actually
>>>>>> tested it, but I'm pretty sure that losing the journal effectively
>>>>>> corrupts your OSDs. I don't know what steps are required to
>>>>>> complete this operation, but it wouldn't surprise me if you need to
>>>>>> re-format the OSD.
>>>>>> 
>>>>> This.
>>>>> All the threads I've read about this indicate that journal loss
>>>>> during operation means OSD loss. Total OSD loss, no recovery.
>>>>> From what I gathered the developers are aware of this and it might be
>>>>> addressed in the future.
>>>>> 
>>>> 
>>>> I suppose I need to try it then. I don?t understand why you can't just
>>>> use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for
>>>> example.
>>>> 
>>> I think the logic is if you shut down an OSD cleanly beforehand you can
>>> just do that.
>>> However from what I gathered there is no logic to re-issue transactions
>>> that made it to the journal but not the filestore.
>>> So a journal SSD failing mid-operation with a busy OSD would certainly
>>> be in that state.
>>> 
>> 
>> I had thought that the journal write and the buffered filestore write
>> happen at the same time.
> 
> Nope, definitely not.
> 
> That's why we have tunables like the ones at:
> http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals
> 
> And people (me included) tend to crank that up (to eleven ^o^).
> 
> The write-out to the filestore may start roughly at the same time as the
> journal gets things, but it can and will fall behind.
> 

filestore max sync interval is the period between the fsync/fdatasync's of the outstanding filestore writes, which were sent earlier. By the time the sync interval arrives, the OS may have already flushed those writes (sysctl's like vm.dirty_ratio, dirty_expire_centisecs, ... apply here). And even if the osd crashes and never calls fsync, then the OS will flush those anyway. Of course, if a power outage prevents the fsync from ever happening, then the journal entry replay is used to re-write the op. The other thing about filestore max sync interval is that journal entries are only free'd after the osd has fsync'd the related filestore write. That's why the journal size depends on the sync interval.

>> So all the previous journal writes that
>> succeeded are already on their way to the filestore. My (could be
>> incorrect) understanding is that the real purpose of the journal is to
>> be able to replay writes after a power outage (since the buffered
>> filestore writes would be lost in that case). If there is no power
>> outage, then filestore writes are still good regardless of a journal
>> failure.
> 
> From Cephs perspective a write is successful once it is on all replica
> size journals.

This is the key point - which I'm not sure about and don't feel like reading the code on a Saturday ;) Is a write ack'd after a successful journal write, or after the journal _and_ the buffered filestore writes? Is that documented somewhere?

> I think (hope) that what you wrote up there to be true, but that doesn't
> change the fact that journal data not even on the way to the filestore yet
> is the crux here.
> 
>>> I'm sure (hope) somebody from the Ceph team will pipe up about this.
>> 
>> Ditto!
> 
> Guess it will be next week...
> 
>>>>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
>>>>> ratio is sensible. However these will be the ones limiting your max
>>>>> sequential write speed if that is of importance to you. In nearly all
>>>>> use cases you run out of IOPS (on your HDDs) long before that becomes
>>>>> an issue, though.
>>>> 
>>>> IOPS is definitely the main limit, but we also only have 1 single
>>>> 10Gig-E NIC on these servers, so 4 drives that can write (even only
>>>> 200MB/s) would be good enough.
>>>> 
>>> Fair enough. ^o^
>>> 
>>>> Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA
>>>> which is shared with the other 20 spinning disks. Counting the double
>>>> writes, the HBA will run out of bandwidth before these SSDs, I expect.
>>>> 
>>> Depends on what PCIe slot it is and so forth. A 2008 should give you
>>> 4GB/s, enough to keep the SSDs happy at least. ^o^
>>> 
>>> A 2008 has only 8 SAS/SATA ports, so are you using port expanders on
>>> your case backplane?
>>> In that case you might want to spread the SSDs out over channels, as in
>>> have 3 HDDs sharing one channel with one SSD.
>> 
>> We use a Promise VTrak J830sS, and now I?ll got ask our hardware team if
>> there would be any benefit to store the SSDs row or column wise.
> 
> Ah, a storage pod. So you have that and a real OSD head server, something
> like a 1U machine or Supermicro Twin?
> Looking at the specs of it I would assume 3 drive per expander, so having
> one SSD mixed with 2 HDDs should definitely be beneficial.
> 

We have 4 servers in a 3U rack, then each of those servers is connected to one of these enclosures with a single SAS cable. 

>> With the current config, when I dd to all drives in parallel I can write
>> at 24*74MB/s = 1776MB/s.
> 
> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 lanes,
> so as far as that bus goes, it can do 4GB/s.
> And given your storage pod I assume it is connected with 2 mini-SAS
> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA bandwidth.