Re: Replication strategy, write throughput

Christian Balzer <chibi@xxxxxxx> · Tue, 8 Nov 2016 17:53:23 +0900

On Tue, 8 Nov 2016 08:55:47 +0100 Andreas Gerstmayr wrote:

> 2016-11-07 3:05 GMT+01:00 Christian Balzer <chibi@xxxxxxx>:
> >
> > Hello,
> >
> > On Fri, 4 Nov 2016 17:10:31 +0100 Andreas Gerstmayr wrote:
> >
> >> Hello,
> >>
> >> I'd like to understand how replication works.
> >> In the paper [1] several replication strategies are described, and
> >> according to a (bit old) mailing list post [2] primary-copy is used.
> >> Therefore the primary OSD waits until the object is persisted and then
> >> updates all replicas in parallel.
> >>
> > Given that nobody who actually knows the answers to this has piped up, I
> > shall give it a go.
> >
> 
> Thanks for your response.
> 
> > The "persisted" above means "written to the journal" and the parallel bit
> > is quite the mystery to me as well.
> > As in, it's clearly not sequential, but one most likely gets kicked off
> > before the other and I for one am uncertain how much delay that entails.
> > But it's unlikely to be the culprit, see below.
> >
> >> Current cluster setup:
> >> Ceph jewel 10.2.3
> >> 6 storage nodes
> >> 24 HDDs each, journal on same disk [3]
> > Unrelated to the topic at hand, journal on same HDD is a recipe for pain.
> > Your fio result against a single HDD is fine (quite good actually), but
> > that is ignoring a lot of complexity in real (Ceph) life.
> >
> >> Frontend network: 10 Gbit/s
> >> Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4]
> > Bandwidth is one thing, latency another.
> > With some setups and switches in particular LACP can add noticeable
> > latency.
> >
> >> CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size
> >>
> > I know nothing about CephFS and how this striping relates to actual RADOS
> > striping, but having smaller stripe units will usually be better for
> > latency, not bandwidth.
> >
> >> My assumption was that there should be no difference whether I write
> >> to replication 2 or 3, because each storage node can accept 10 Gbit/s
> >> traffic from the frontend network and send 10 Gbit/s traffic
> >> simultaneous to two other storage nodes.
> >>
> > Theory, meet reality.
> > The biggest impact when increasing replication size is latency (even when
> > going from 2 to 3, thus my question marks about how parallel these things
> > really are).
> > So while that will affect your large sequential writes somewhat, it's not
> > what causing your results.
> >
> >> Disk write capacity shouldn't be a problem either:
> >> 200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s.
> >>
> > This is where your expectations and estimations start to fall apart.
> > The above is overly optimistic and simplistic.
> >
> > It assumes:
> > 1. Single, sequential writes over the whole cluster (the biggest flaw)
> > 2. Sequential writes within the OSD (they're not, it's 4 or in your case
> > maybe 1MB individual file writes).
> > 3. No overhead from FS journaling and the resulting double and seeks.
> >
> 
> Yes, the writes of the OSD are 1 MB in size (checked with strace).
> Thanks for the hint, you 're right, I didn't count the (double) seek latency in.
> 
> > I personally feel happy when individual HDD based OSDs WITH SSD journals
> > can maintain about 60MB/s writes.
> >
> 
> Only 60 MB/s? Is this because the writes of the OSDs are in reality
> more random than sequential?
Basically, yes.

> Do you have a tip for a fio workload which matches the "real" OSD
> workload the best?
Not perfectly, no.

> 2 parallel jobs with one job simulating the journal (sequential
> writes, ioengine=libaio, direct=1, sync=1, iodeph=128, bs=1MB) and the
> other job simulating the datastore (random writes of 1MB)?
>
To test against a single HDD?
Yes, something like that, the first fio job would need go against a raw
partition and the iodepth isn't anywhere that high with a journal, in
theory it's actually 1 (some Ceph developer please pipe up here).

The 2nd fio needs to run against an actual FS, the bs for both should
match your stripe unit size for sequential tests.

What this setup misses, especially the 2nd part is that Ceph operates on
individual files which it has to create on the fly for the first time, may
create or delete sub directories and trees, updates a leveldb[*] on the
same FS, etc...

[*] see /var/lib/ceph/osd/ceph-nn/current/omap/

> > Looking at your results below, that number seems to be around 40MB/s
> > for you in the single replication test and thus should be the baseline
> > for your expectations.
> >
> 
> 40 MB/s = 5695.33 MB/s / 144 OSDs?
Bingo.

> So your assumption is that the bottleneck for replication 1 are the disks?
Precisely.

> That would explain the 1/2 throughput for replication 2 and 1/3 for
> replication 3.
Doesn't it just. ^o^

> Last time I checked the disks were well utilized (i.e. they were busy
> almost 100% the time), but that doesn't equate to "can't accept more
> I/O operations". 
Well, if it is really 100% busy and the next journal write has to wait
until all the seek and SYNC is done, then Ceph will block at this point
course.

>The throughput (as seen by iostat -xz 1) was way
> below the maximum. 
Around 40MB/s per chance?

>But as you noted, I have to count in the (double)
> seek latency (and probably a different access pattern) as well.
>
Very different patterns.

> > As always, I recommend running atop or the likes  on all your nodes with a
> > high (5s or less) interval in huge terminals when doing such tests.
> > You'll see how busy your individual OSDs are and that some are going to be
> > near or at their max capacity.
> >
> > With that in mind, remember that Ceph doesn't guarantee equal distribution
> > of anything and that the slowest OSD will determine the speed of the whole
> > thing (eventually).
> > So even with 144 OSDs you'll likely have some OSDs that are busier
> > (receiving primary and secondary writes at the same time) than others,
> > slowing things down for everyone.
> >
> 
> Another very good point, thanks! I already didn't assume a perfect
> equal distribution, and knew that the slowest disk determines the
> speed, but didn't think about the case that a single OSD receiving
> primary and secondary writes at the same time.
> 
> > Again, IF you're testing against CephFS and IF the stripping does what I
> > think it does, you're creating more I/O and thus busy OSDs than a default
> > setup would do.
> >
> 
> Yes, I'm testing against CephFS.
> How do I create more I/O than in a default setup?
>
Make that "more distributed I/O".
As in, you keep 4 times more OSDs busy than with the 4MB default stripe
size.
Which would be a good thing for small writes, so they hit different disks,
in an overall not very busy cluster.
For sequential writes at full speed, not so much.

Christian
> >>
> >> Results with 7 clients:
> > Details please.
> > As in, what test (fio/dd?) did you use (exact command line), against what
> > (CephFS, mounted how?).
> >
> >
> >> Replication 1: 5695.33 MB/s
> >> Replication 2: 3337.09 MB/s
> >> Replication 3: 1898.17 MB/s
> >>
> >> Replication 2 is about 1/2 of Replication 1, and Replication 3 is
> >> exact 1/3 of Replication 1.
> >> Any hints what the bottleneck is in this case?
> >>
> > As hopefully made clear above, primarily your OSD bandwidth.
> >
> 
> Sorry, I completely forgot to add the command (executed parallel on 7 clients):
> 
> fio --name=job1 --rw=write --blocksize=64K --numjobs=1 --runtime=300
> --size=400G --randrepeat=0 --fallocate=none --refill_buffers
> --end_fsync=1 --directory=/media/ceph/repl1_stripeunit-1M_stripecount-10_objectsize-10M/benchmarkfiles/<hostname_of_client>
> --group_reporting --write_bw_log=job --write_iops_log=job
> 
> Tested against CephFS with default mount settings.
> I already repeated the tests with 1 MB as blocksize to match the
> stripe unit (the same I/O size the OSD writes to the disk) without any
> difference in throughput.
> 
> Thanks for your suggestions!
> Andreas
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com