Re: Replication strategy, write throughput

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Tue, 8 Nov 2016 08:55:47 +0100

2016-11-07 3:05 GMT+01:00 Christian Balzer <chibi@xxxxxxx>:
>
> Hello,
>
> On Fri, 4 Nov 2016 17:10:31 +0100 Andreas Gerstmayr wrote:
>
>> Hello,
>>
>> I'd like to understand how replication works.
>> In the paper [1] several replication strategies are described, and
>> according to a (bit old) mailing list post [2] primary-copy is used.
>> Therefore the primary OSD waits until the object is persisted and then
>> updates all replicas in parallel.
>>
> Given that nobody who actually knows the answers to this has piped up, I
> shall give it a go.
>

Thanks for your response.

> The "persisted" above means "written to the journal" and the parallel bit
> is quite the mystery to me as well.
> As in, it's clearly not sequential, but one most likely gets kicked off
> before the other and I for one am uncertain how much delay that entails.
> But it's unlikely to be the culprit, see below.
>
>> Current cluster setup:
>> Ceph jewel 10.2.3
>> 6 storage nodes
>> 24 HDDs each, journal on same disk [3]
> Unrelated to the topic at hand, journal on same HDD is a recipe for pain.
> Your fio result against a single HDD is fine (quite good actually), but
> that is ignoring a lot of complexity in real (Ceph) life.
>
>> Frontend network: 10 Gbit/s
>> Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4]
> Bandwidth is one thing, latency another.
> With some setups and switches in particular LACP can add noticeable
> latency.
>
>> CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size
>>
> I know nothing about CephFS and how this striping relates to actual RADOS
> striping, but having smaller stripe units will usually be better for
> latency, not bandwidth.
>
>> My assumption was that there should be no difference whether I write
>> to replication 2 or 3, because each storage node can accept 10 Gbit/s
>> traffic from the frontend network and send 10 Gbit/s traffic
>> simultaneous to two other storage nodes.
>>
> Theory, meet reality.
> The biggest impact when increasing replication size is latency (even when
> going from 2 to 3, thus my question marks about how parallel these things
> really are).
> So while that will affect your large sequential writes somewhat, it's not
> what causing your results.
>
>> Disk write capacity shouldn't be a problem either:
>> 200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s.
>>
> This is where your expectations and estimations start to fall apart.
> The above is overly optimistic and simplistic.
>
> It assumes:
> 1. Single, sequential writes over the whole cluster (the biggest flaw)
> 2. Sequential writes within the OSD (they're not, it's 4 or in your case
> maybe 1MB individual file writes).
> 3. No overhead from FS journaling and the resulting double and seeks.
>

Yes, the writes of the OSD are 1 MB in size (checked with strace).
Thanks for the hint, you 're right, I didn't count the (double) seek latency in.

> I personally feel happy when individual HDD based OSDs WITH SSD journals
> can maintain about 60MB/s writes.
>

Only 60 MB/s? Is this because the writes of the OSDs are in reality
more random than sequential?
Do you have a tip for a fio workload which matches the "real" OSD
workload the best?
2 parallel jobs with one job simulating the journal (sequential
writes, ioengine=libaio, direct=1, sync=1, iodeph=128, bs=1MB) and the
other job simulating the datastore (random writes of 1MB)?

> Looking at your results below, that number seems to be around 40MB/s
> for you in the single replication test and thus should be the baseline
> for your expectations.
>

40 MB/s = 5695.33 MB/s / 144 OSDs?
So your assumption is that the bottleneck for replication 1 are the disks?
That would explain the 1/2 throughput for replication 2 and 1/3 for
replication 3.
Last time I checked the disks were well utilized (i.e. they were busy
almost 100% the time), but that doesn't equate to "can't accept more
I/O operations". The throughput (as seen by iostat -xz 1) was way
below the maximum. But as you noted, I have to count in the (double)
seek latency (and probably a different access pattern) as well.

> As always, I recommend running atop or the likes  on all your nodes with a
> high (5s or less) interval in huge terminals when doing such tests.
> You'll see how busy your individual OSDs are and that some are going to be
> near or at their max capacity.
>
> With that in mind, remember that Ceph doesn't guarantee equal distribution
> of anything and that the slowest OSD will determine the speed of the whole
> thing (eventually).
> So even with 144 OSDs you'll likely have some OSDs that are busier
> (receiving primary and secondary writes at the same time) than others,
> slowing things down for everyone.
>

Another very good point, thanks! I already didn't assume a perfect
equal distribution, and knew that the slowest disk determines the
speed, but didn't think about the case that a single OSD receiving
primary and secondary writes at the same time.

> Again, IF you're testing against CephFS and IF the stripping does what I
> think it does, you're creating more I/O and thus busy OSDs than a default
> setup would do.
>

Yes, I'm testing against CephFS.
How do I create more I/O than in a default setup?

>>
>> Results with 7 clients:
> Details please.
> As in, what test (fio/dd?) did you use (exact command line), against what
> (CephFS, mounted how?).
>
>
>> Replication 1: 5695.33 MB/s
>> Replication 2: 3337.09 MB/s
>> Replication 3: 1898.17 MB/s
>>
>> Replication 2 is about 1/2 of Replication 1, and Replication 3 is
>> exact 1/3 of Replication 1.
>> Any hints what the bottleneck is in this case?
>>
> As hopefully made clear above, primarily your OSD bandwidth.
>

Sorry, I completely forgot to add the command (executed parallel on 7 clients):

fio --name=job1 --rw=write --blocksize=64K --numjobs=1 --runtime=300
--size=400G --randrepeat=0 --fallocate=none --refill_buffers
--end_fsync=1 --directory=/media/ceph/repl1_stripeunit-1M_stripecount-10_objectsize-10M/benchmarkfiles/<hostname_of_client>
--group_reporting --write_bw_log=job --write_iops_log=job

Tested against CephFS with default mount settings.
I already repeated the tests with 1 MB as blocksize to match the
stripe unit (the same I/O size the OSD writes to the disk) without any
difference in throughput.

Thanks for your suggestions!
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com