2016-11-07 3:05 GMT+01:00 Christian Balzer <chibi@xxxxxxx>: > > Hello, > > On Fri, 4 Nov 2016 17:10:31 +0100 Andreas Gerstmayr wrote: > >> Hello, >> >> I'd like to understand how replication works. >> In the paper [1] several replication strategies are described, and >> according to a (bit old) mailing list post [2] primary-copy is used. >> Therefore the primary OSD waits until the object is persisted and then >> updates all replicas in parallel. >> > Given that nobody who actually knows the answers to this has piped up, I > shall give it a go. > Thanks for your response. > The "persisted" above means "written to the journal" and the parallel bit > is quite the mystery to me as well. > As in, it's clearly not sequential, but one most likely gets kicked off > before the other and I for one am uncertain how much delay that entails. > But it's unlikely to be the culprit, see below. > >> Current cluster setup: >> Ceph jewel 10.2.3 >> 6 storage nodes >> 24 HDDs each, journal on same disk [3] > Unrelated to the topic at hand, journal on same HDD is a recipe for pain. > Your fio result against a single HDD is fine (quite good actually), but > that is ignoring a lot of complexity in real (Ceph) life. > >> Frontend network: 10 Gbit/s >> Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4] > Bandwidth is one thing, latency another. > With some setups and switches in particular LACP can add noticeable > latency. > >> CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size >> > I know nothing about CephFS and how this striping relates to actual RADOS > striping, but having smaller stripe units will usually be better for > latency, not bandwidth. > >> My assumption was that there should be no difference whether I write >> to replication 2 or 3, because each storage node can accept 10 Gbit/s >> traffic from the frontend network and send 10 Gbit/s traffic >> simultaneous to two other storage nodes. >> > Theory, meet reality. > The biggest impact when increasing replication size is latency (even when > going from 2 to 3, thus my question marks about how parallel these things > really are). > So while that will affect your large sequential writes somewhat, it's not > what causing your results. > >> Disk write capacity shouldn't be a problem either: >> 200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s. >> > This is where your expectations and estimations start to fall apart. > The above is overly optimistic and simplistic. > > It assumes: > 1. Single, sequential writes over the whole cluster (the biggest flaw) > 2. Sequential writes within the OSD (they're not, it's 4 or in your case > maybe 1MB individual file writes). > 3. No overhead from FS journaling and the resulting double and seeks. > Yes, the writes of the OSD are 1 MB in size (checked with strace). Thanks for the hint, you 're right, I didn't count the (double) seek latency in. > I personally feel happy when individual HDD based OSDs WITH SSD journals > can maintain about 60MB/s writes. > Only 60 MB/s? Is this because the writes of the OSDs are in reality more random than sequential? Do you have a tip for a fio workload which matches the "real" OSD workload the best? 2 parallel jobs with one job simulating the journal (sequential writes, ioengine=libaio, direct=1, sync=1, iodeph=128, bs=1MB) and the other job simulating the datastore (random writes of 1MB)? > Looking at your results below, that number seems to be around 40MB/s > for you in the single replication test and thus should be the baseline > for your expectations. > 40 MB/s = 5695.33 MB/s / 144 OSDs? So your assumption is that the bottleneck for replication 1 are the disks? That would explain the 1/2 throughput for replication 2 and 1/3 for replication 3. Last time I checked the disks were well utilized (i.e. they were busy almost 100% the time), but that doesn't equate to "can't accept more I/O operations". The throughput (as seen by iostat -xz 1) was way below the maximum. But as you noted, I have to count in the (double) seek latency (and probably a different access pattern) as well. > As always, I recommend running atop or the likes on all your nodes with a > high (5s or less) interval in huge terminals when doing such tests. > You'll see how busy your individual OSDs are and that some are going to be > near or at their max capacity. > > With that in mind, remember that Ceph doesn't guarantee equal distribution > of anything and that the slowest OSD will determine the speed of the whole > thing (eventually). > So even with 144 OSDs you'll likely have some OSDs that are busier > (receiving primary and secondary writes at the same time) than others, > slowing things down for everyone. > Another very good point, thanks! I already didn't assume a perfect equal distribution, and knew that the slowest disk determines the speed, but didn't think about the case that a single OSD receiving primary and secondary writes at the same time. > Again, IF you're testing against CephFS and IF the stripping does what I > think it does, you're creating more I/O and thus busy OSDs than a default > setup would do. > Yes, I'm testing against CephFS. How do I create more I/O than in a default setup? >> >> Results with 7 clients: > Details please. > As in, what test (fio/dd?) did you use (exact command line), against what > (CephFS, mounted how?). > > >> Replication 1: 5695.33 MB/s >> Replication 2: 3337.09 MB/s >> Replication 3: 1898.17 MB/s >> >> Replication 2 is about 1/2 of Replication 1, and Replication 3 is >> exact 1/3 of Replication 1. >> Any hints what the bottleneck is in this case? >> > As hopefully made clear above, primarily your OSD bandwidth. > Sorry, I completely forgot to add the command (executed parallel on 7 clients): fio --name=job1 --rw=write --blocksize=64K --numjobs=1 --runtime=300 --size=400G --randrepeat=0 --fallocate=none --refill_buffers --end_fsync=1 --directory=/media/ceph/repl1_stripeunit-1M_stripecount-10_objectsize-10M/benchmarkfiles/<hostname_of_client> --group_reporting --write_bw_log=job --write_iops_log=job Tested against CephFS with default mount settings. I already repeated the tests with 1 MB as blocksize to match the stripe unit (the same I/O size the OSD writes to the disk) without any difference in throughput. Thanks for your suggestions! Andreas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com