SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan van der Ster) · Sat, 6 Sep 2014 14:50:20 +0000

September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com> wrote: 
> On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:
> 
>> Hi Christian,
>> 
>> Let's keep debating until a dev corrects us ;)
> 
> For the time being, I give the recent:
> 
> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html
> 
> And not so recent:
> http://www.spinics.net/lists/ceph-users/msg04152.html
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> 
> And I'm not going to use BTRFS for mainly RBD backed VM images
> (fragmentation city), never mind the other stability issues that crop up
> here ever so often.

Thanks for the links... So until I learn otherwise, I better assume the OSD is lost when the journal fails. Even though I haven't understood exactly why :(
I'm going to UTSL to understand the consistency better. An op state diagram would help, but I didn't find one yet.

BTW, do you happen to know, _if_ we re-use an OSD after the journal has failed, are any object inconsistencies going to be found by a scrub/deep-scrub?

>> 
>> We have 4 servers in a 3U rack, then each of those servers is connected
>> to one of these enclosures with a single SAS cable.
>> 
>>>> With the current config, when I dd to all drives in parallel I can
>>>> write at 24*74MB/s = 1776MB/s.
>>> 
>>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
>>> lanes, so as far as that bus goes, it can do 4GB/s.
>>> And given your storage pod I assume it is connected with 2 mini-SAS
>>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
>>> bandwidth.
>> 
>> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> 
> Alright, that explains that then. Any reason for not using both ports?
> 

Probably to minimize costs, and since the single 10Gig-E is a bottleneck anyway.
The whole thing is suboptimal anyway, since this hardware was not purchased for Ceph to begin with.
Hence retrofitting SSDs, etc...

>>> Impressive, even given your huge cluster with 1128 OSDs.
>>> However that's not really answering my question, how much data is on an
>>> average OSD and thus gets backfilled in that hour?
>> 
>> That's true -- our drives have around 300TB on them. So I guess it will
>> take longer - 3x longer - when the drives are 1TB full.
> 
> On your slides, when the crazy user filled the cluster with 250 million
> objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> 

Yeah that was fun :) It was 250 million (mostly) 4k objects, so not close to 1PB. The point was that to fill the cluster with RBD, we'd need 250 million (4MB) objects. So, object-count-wise this was a full cluster, but for the real volume it was more like 70TB IIRC (there were some other larger objects too).

In that case, the backfilling was CPU-bound, or perhaps wbthrottle-bound, I don't remember... It was just that there were many tiny tiny objects to synchronize.

> Anyway, I guess the lesson to take away from this is that size and
> parallelism does indeed help, but even in a cluster like yours recovering
> from a 2TB loss would likely be in the 10 hour range...

Bigger clusters probably backfill faster simply because there are more OSDs involved in the backfilling. In our cluster we initially get 30-40 backfills in parallel after 1 OSD fails. That's even with max backfills = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is spent backfilling the last 20% of the PGs, just because some OSDs randomly get more new PGs than the others.

> Again, see the "Best practice K/M-parameters EC pool" thread. ^.^

Marked that one to read, again.

Cheers, dan