Re: Possible improvements for a slow write speed (excluding independent SSD journals)

Christian Balzer <chibi@xxxxxxx> · Tue, 21 Apr 2015 08:48:06 +0900

On Mon, 20 Apr 2015 13:17:18 -0400 J-P Methot wrote:

> On 4/20/2015 11:01 AM, Christian Balzer wrote:
> > Hello,
> >
> > On Mon, 20 Apr 2015 10:30:41 -0400 J-P Methot wrote:
> >
> >> Hi,
> >>
> >> This is similar to another thread running right now, but since our
> >> current setup is completely different from the one described in the
> >> other thread, I thought it may be better to start a new one.
> >>
> >> We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We
> >> have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD
> >> is a Samsung SSD 840 EVO on which I can reach write speeds of roughly
> >> 400 MB/sec, plugged in jbod on a controller that can theoretically
> >> transfer at 6gb/sec. All of that is linked to openstack compute nodes
> >> on two bonded 10gbps links (so a max transfer rate of 20 gbps).
> >>
> > I sure as hell hope you're not planning to write all that much to this
> > cluster.
> > But then again you're worried about write speed, so I guess you do.
> > Those _consumer_ SSDs will be dropping like flies, there are a number
> > of threads about them here.
> >
> > They also might be of the kind that don't play well with O_DSYNC, I
> > can't recall for sure right now, check the archives.
> >   
> > Consumer SSDs universally tend to slow down quite a bit when not
> > TRIM'ed and/or subjected to prolonged writes, like those generated by
> > a benchmark.
> I see, yes it looks like these SSDs are not the best for the job. We 
> will not change them for now, but if they start failing, we will replace 
> them with better ones.

If the recent/current thread "replace dead SSD journal" is any indication,
you may have no warning from SMART about them failing, certainly not when
one expects them to based on wear leveling.
There were other threads about Samsung SSDs in this vein as well.

If you can afford it, replace half of them now and create CRUSH a rule to
create a failure domain for them and the new ones. Intel DC S3610 come to
mind, but that REALLY depends on your write levels and budget of course.

> >> When I run rados bench from the compute nodes, I reach the network cap
> >> in read speed. However, write speeds are vastly inferior, reaching
> >> about 920 MB/sec. If I have 4 compute nodes running the write
> >> benchmark at the same time, I can see the number plummet to 350
> >> MB/sec . For our planned usage, we find it to be rather slow,
> >> considering we will run a high number of virtual machines in there.
> >>
> > If you'd researched the archives about SSD based clusters, you'd be
> > aware that things are likely going to be CPU bound, especially with
> > Firefly or older.
> > Verify that with atop or the likes, how busy are your CPUs and SSDs on
> > those nodes when running these benchmarks?
> >
> > Also, default rados bench parameters?
> >
> > You'll get a much more realistic result with fio, especially with fio
> > from inside VMs.
> I did FIO tests from inside vms in the past and they were pretty much in 
> line with the results I get from rados bench. I am using rados bench 
> default parameters.
> 
So 4MB "blocks", optimized for Ceph.
While this is a good indicator of what your bandwidth/throughput is, your
VMs are potentially going to do very different IO (especially when it
comes to non-cacheable, small writes).
These are also the ones that will melt your CPUs, especially if
that's a single E5-2620 per node.

> The load numbers on the OSD nodes aren't that high. I'm getting a load 
> of 3 on a six-core E5-2620, which is not that high for such a cpu. 
> Regardless, I am currently upgrading to hammer in hope of at least 
> achieving some progress.
>
Load is not a precise measurement, but it indicates that your bottleneck
are indeed the SSDs.
Again, verify that with atop for an encompassing picture or iostat for
your storage in particular.

> >> Of course, the first thing to do would be to transfer the journal on
> >> faster drives. However, these are SSDs we're talking about. We don't
> >> really have access to faster drives. I must find a way to get better
> >> write speeds. Thus, I am looking for suggestions as to how to make it
> >> faster.
> >>
> >> I have also thought of options myself like:
> >> -Upgrading to the latest stable hammer version (would that really give
> >> me a big performance increase?)
> > If you're CPU bound right now, yes.
> > Don't expect miracles, though.
> >  From what I've seen so far (no test cluster of the SSD kind myself) it
> > seems 50% faster, maybe more.
> >
> >> -Crush map modifications? (this is a long shot, but I'm still using
> >> the default crush map, maybe there's a change there I could make to
> >> improve performances)
> >>
> > You have uniform OSDs, so not really.
> >
> >> Any suggestions as to anything else I can tweak would be strongly
> >> appreciated.
> >>
> >> For reference, here's part of my ceph.conf:
> >>
> >> [global]
> >> auth_service_required = cephx
> >> filestore_xattr_use_omap = true
> >> auth_client_required = cephx
> >> auth_cluster_required = cephx
> >> osd pool default size = 3
> >>
> >>
> >> osd pg bits = 12
> >> osd pgp bits = 12
> >> osd pool default pg num = 800
> >> osd pool default pgp num = 800
> >>
> > How many pools do you have?
> > For your cluster a total of 8192 would be about right.
> > PGs should be power of 2 values if possible, so with 8-10 pools (of
> > equal expected data size) a value of 1024 would be better.
> >
> > Regards,
> >
> > Christian
> Running 10 pools right now, but 2 are for test purposes. Once the two 
> test are removed, I'm at around 8000 PGs, which is about what you 
> suggested. I can increase the PG number to 1024 per pool, but I'm not 
> convinced a change of 200 PGs will show a drastic improvement in speed.
> >   
That's a general optimization to make it easier for the CRUSH algorithm and
thus data distribution, not related to speed.

Christian

> >> [client]
> >> rbd cache = true
> >> rbd cache writethrough until flush = true
> >>
> >> [osd]
> >> filestore_fd_cache_size = 1000000
> >> filestore_omap_header_cache_size = 1000000
> >> filestore_fd_cache_random = true
> >> filestore_queue_max_ops = 5000
> >> journal_queue_max_ops = 1000000
> >> max_open_files = 1000000
> >> osd journal size = 10000
> >>
> >
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com