Re: Replication strategy, write throughput

Christian Balzer <chibi@xxxxxxx> · Thu, 10 Nov 2016 13:16:42 +0900

Hello,

On Wed, 9 Nov 2016 21:56:08 +0100 Andreas Gerstmayr wrote:

> Hello,
> 
> >> 2 parallel jobs with one job simulating the journal (sequential
> >> writes, ioengine=libaio, direct=1, sync=1, iodeph=128, bs=1MB) and the
> >> other job simulating the datastore (random writes of 1MB)?
> >>
> > To test against a single HDD?
> > Yes, something like that, the first fio job would need go against a raw
> > partition and the iodepth isn't anywhere that high with a journal, in
> > theory it's actually 1 (some Ceph developer please pipe up here).
> >
> 
> I took that number (iodeph=128) from https://github.com/ceph/ceph/blob/master/src/os/filestore/FileJournal.cc#L111
>  From the io_setup manpage: "The io_setup() system call creates an asynchronous I/O context suitable for concurrently processing nr_events operations."
>
I claim total ignorance with regards to Ceph code and how this above
actually translates to disk activity, that's why I tried (and failed) to
summon someone who knows for certain.
Either way, sequential (and circular, writing to the same space over and
over again) is the key word here.

> > The 2nd fio needs to run against an actual FS, the bs for both should
> > match your stripe unit size for sequential tests.
> >
> > What this setup misses, especially the 2nd part is that Ceph operates on
> > individual files which it has to create on the fly for the first time, may
> > create or delete sub directories and trees, updates a leveldb[*] on the
> > same FS, etc...
> >
> >
> > [*] see /var/lib/ceph/osd/ceph-nn/current/omap/
> >
> 
> Good point, thanks.
> 
> >> Last time I checked the disks were well utilized (i.e. they were busy
> >> almost 100% the time), but that doesn't equate to "can't accept more
> >> I/O operations".
> > Well, if it is really 100% busy and the next journal write has to wait
> > until all the seek and SYNC is done, then Ceph will block at this point
> > course.
> >
> >>The throughput (as seen by iostat -xz 1) was way
> >> below the maximum.
> > Around 40MB/s per chance?
> 
> I repeated the test with 7 clients x 1 thread, replication 1,
> CephFS stripe unit=4MB, stripe count=1, object size=4MB
> 
> Output from iostat -xzm 5:
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            12,33    0,00   20,33   24,38    0,00   42,96
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,60    0,00  143,60     0,00    62,74   894,74    10,03   60,23    0,00   60,23   5,69  81,72
> sdg               0,00    74,80    0,00  270,00     0,00   102,43   776,97    83,25  308,53    0,00  308,53   3,63  97,96
> sdk               0,00     2,00    0,00  181,40     0,00    79,05   892,52    16,36   89,97    0,00   89,97   5,07  91,96
> sdd               0,00     4,00    0,00  244,20     0,00   114,64   961,40   165,66  662,56    0,00  662,56   4,09  99,84
> sdl               0,00     0,60    0,00  185,60     0,00    84,05   927,42    61,15  441,91    0,00  441,91   4,31  79,94
> sde               0,00     0,80    0,00  183,60     0,00    82,76   923,14    54,01  520,53    0,00  520,53   4,64  85,28
> sdj               0,00     1,80    0,00  242,20     0,00   111,45   942,42   119,56  493,59    0,00  493,59   4,00  96,98
> sdi               0,00     4,00    0,00  192,60     0,00    90,05   957,57   109,19  450,13    0,00  450,13   4,69  90,42
> sdp               0,00     2,80    0,00  170,80     0,00    74,78   896,67    10,72   58,13    0,00   58,13   5,48  93,68
> sds               0,00     2,00    0,00  178,00     0,00    80,92   931,05    48,59  273,00    0,00  273,00   5,08  90,44
> sdn               0,00     0,40    0,00  178,60     0,00    77,97   894,09    10,04   66,59    0,00   66,59   4,98  89,02
> sdr               0,00    64,20    0,00  205,60     0,00    83,93   835,99    49,16  218,47    0,00  218,47   4,65  95,64
> sdu               0,00     1,80    0,00  194,20     0,00    87,82   926,11    53,98  177,14    0,00  177,14   5,11  99,32
> sdx               0,00     1,20    0,00  175,00     0,00    78,73   921,42    33,68  131,99    0,00  131,99   5,47  95,78
> sda               0,00     2,20    0,00  218,40     0,00    97,16   911,07    39,80  182,23    0,00  182,23   4,51  98,48
> sdm               0,00    74,00    1,00  244,60     0,01    86,52   721,50    49,40  180,41   54,80  180,93   3,94  96,84
> sdq               0,00     0,60    0,00  163,80     0,00    73,04   913,18    17,03   62,75    0,00   62,75   5,77  94,52
> sdh               0,00    97,00    1,00  211,40     0,01    71,28   687,35    67,05  238,17   53,20  239,05   4,43  94,14
> sdf               0,00     1,00    0,00  162,80     0,00    73,02   918,55    27,24  167,31    0,00  167,31   5,40  87,96
> sdo               0,00     2,00    0,00  244,40     0,00   111,54   934,68    91,99  522,59    0,00  522,59   3,91  95,58
> sdc               0,00     0,80    0,40  203,20     0,00    90,49   910,26    25,71  126,16   46,00  126,32   4,75  96,80
> sdt               0,00     1,00    0,00  165,60     0,00    74,71   923,98    31,09  188,86    0,00  188,86   4,94  81,86
> sdw               0,00     3,40    0,00  223,40     0,00   104,10   954,37   144,31  532,19    0,00  532,19   4,46  99,64
> sdv               0,00     2,20    0,00  242,00     0,00   109,82   929,40    71,02  293,66    0,00  293,66   4,03  97,60
> 
> On average each disk was busy writing 87 MB/s. Average queue length per disk is 57.
> Similar output on the other 5 servers.
> Aggregated client throughput is 5682 MB/s (no change in throughput compared to the other striping configuration).
> 
> When I divide that number by 2 (because of the journal), I get the 43.5 MB/s.
> So this is my effective write speed per disk? 

Yes, at least for this "full blast" sequential write approach. 
Which reminds me to say something I usually do much earlier during this
kind of discussion.
Knowing the bandwidth/write speed of your cluster is a worthy goal, but in
nearly all use cases you tend to run out of IOPS (or get unacceptable
latency, especially w/o journal SSDs) long before you hit the maximum,
sustained write speed limit.

> Guess I should try the bluestore asap (to avoid the double writes).

I wouldn't get my hopes up for this with regards to the S(oon) in ASAP. 
My estimate is this will be declared "production ready" sometimes next
year and that anybody who actually values their data (like me) won't touch
it before 2018.

> 
> Same benchmark test with replication 3 (and same striping settings as above):
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            12,91    0,00   26,38   28,04    0,00   32,67
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,60    0,00  193,60     0,00    90,83   960,82    91,92  491,66    0,00  491,66   4,48  86,78
> sdg               0,00    57,20    0,20  213,40     0,00    83,04   796,23    58,43  278,98   67,00  279,18   4,46  95,22
> sdk               0,00     6,00    0,00  223,20     0,00   108,08   991,69   103,20  423,60    0,00  423,60   4,45  99,26
> sdd               0,00    44,60    0,00  237,20     0,00    94,67   817,39    97,37  395,30    0,00  395,30   4,19  99,38
> sdl               0,00     1,80    0,00  286,40     0,00   137,32   981,94   177,65  635,85    0,00  635,85   3,49 100,00
> sde               0,00     1,20    0,00  183,60     0,00    82,56   920,90    24,12  151,46    0,00  151,46   5,21  95,60
> sdj               0,00    10,00    0,00  232,20     0,00   115,36  1017,51   145,47  593,01    0,00  593,01   4,31 100,00
> sdi               0,00     5,00    0,00  210,00     0,00   101,46   989,49   117,92  501,59    0,00  501,59   4,71  98,88
> sdp               0,00     5,00    0,00  257,60     0,00   124,37   988,76   186,43  651,64    0,00  651,64   3,88  99,96
> sds               0,00     1,20    0,00  249,00     0,00   118,78   976,91   107,23  529,87    0,00  529,87   3,99  99,42
> sdn               0,00     3,40    0,00  282,60     0,00   135,38   981,12   212,53  768,44    0,00  768,44   3,54  99,98
> sdr               0,00    41,60    0,00  265,20     0,00   112,67   870,09    96,91  430,14    0,00  430,14   3,75  99,56
> sdu               0,00    55,60    0,00  187,20     0,00    72,29   790,89    23,08  173,73    0,00  173,73   5,09  95,26
> sdx               0,40     2,20    0,80  222,20     0,15   109,18  1004,08   194,31  757,00  573,50  757,66   4,48 100,00
> sda               0,00     1,40    0,00  266,60     0,00   129,20   992,49   145,07  567,22    0,00  567,22   3,75 100,00
> sdm               0,00    58,60    0,00  225,20     0,00    89,67   815,43    75,26  333,63    0,00  333,63   4,41  99,28
> sdq               0,00     2,20    0,00  239,60     0,00   117,24  1002,14   142,16  627,56    0,00  627,56   4,17  99,98
> sdh               0,00    70,20    0,00  234,80     0,00    93,04   811,55    66,73  228,26    0,00  228,26   4,23  99,26
> sdf               0,00     2,40    0,00  271,00     0,00   131,08   990,63   221,43  968,74    0,00  968,74   3,69 100,00
> sdo               0,00     2,00    0,00  256,00     0,00   124,71   997,67   193,33  770,51    0,00  770,51   3,91 100,00
> sdc               0,00     2,60    0,00  267,40     0,00   126,37   967,88   141,07  640,00    0,00  640,00   3,73  99,86
> sdt               0,00     1,60    0,00  203,20     0,00    95,73   964,83    74,25  456,10    0,00  456,10   4,48  91,10
> sdw               0,00     2,80    0,00  267,20     0,00   128,19   982,56   127,02  482,74    0,00  482,74   3,74  99,98
> sdv               0,00     8,60    0,00  200,80     0,00    94,03   959,08    57,78  291,25    0,00  291,25   4,89  98,28
> 
> Average throughput: 108 MB/s, average queue length: 119.
> Aggregated client throughput is 1903 MB/s (also no measurable change).
> The average throughput divided by 2 (because of the journal) would be 54 MB/s,
> and therefore 144 OSDs * 54 MB/s = 7776 MB/s should be the baseline?
> 
Yes, since with replication 3 you REALLY will get every last OSD 100% busy
eventually, as you next paragraph clearly reveals.
You want your OSDs to be less that 70% busy during normal operations at
all times.

> I noticed that during benchmarks with replication 3 I get lots of blocked ops
> (I don't get that much replication 1), which disappear after the benchmark has finished:
> 15155 requests are blocked > 32 sec; 143 osds have slow requests;
> 
Because with 3 times more writes you're more likely to get into the
previously discussed "queue full, journal blocked" situations.

As a rule of thumb, if your cluster suffers blocked operations from tests
like this or more precisely something like "rados bench", you want to
re-think the design, in particular considering journal SSDs.

> Logs of a random OSD tell me:
> 
> 2016-11-09 20:35:57.322967 7f5684c65700  0 log_channel(cluster) log [WRN] : 195 slow requests, 5 included below; oldest blocked for > 70.047596 secs
> 2016-11-09 20:35:57.322979 7f5684c65700  0 log_channel(cluster) log [WRN] : slow request 60.405079 seconds old, received at 2016-11-09 20:34:56.917801: osd_repop(client.65784.1:22895980 5.49e 5:7939450d:::1000000a2e2.00000429:head v 1191'192668) currently started
> 2016-11-09 20:35:57.322985 7f5684c65700  0 log_channel(cluster) log [WRN] : slow request 30.160712 seconds old, received at 2016-11-09 20:35:27.162168: osd_repop(client.65781.1:23303524 5.614 5:286a4a86:::1000000996a.00000fab:head v 1191'191953) currently started
> 
> All slow requests have to do with "osd_repop".
> 
> Looking at a single server:
> About 1800 network segments get retransmitted per second, of that 1400 TCPFastRetrans.
> 
> If I take a look at netstat -s:
> 229373047157 segments send out
> 802376626 segments retransmited
> Only 0,35% of the segments get retransmitted.
> 
I have to look at internet facing busy servers here to get that kind of
retransmit levels, our Ceph clusters all use IPoIB (64k MTU) and even the
busiest one is an order of magnitude lower than this:
---
    44115975570 segments send out
    766216 segments retransmited
---

And of course SSD journals or pure SSD OSDs in the cache tier.

> 
> Can I deduct that the disks are saturated, therefore the blocked ops, therefore the depicted network traffic pattern?
> 
Pretty much, yes.

> 
> > Make that "more distributed I/O".
> > As in, you keep 4 times more OSDs busy than with the 4MB default stripe
> > size.
> > Which would be a good thing for small writes, so they hit different disks,
> > in an overall not very busy cluster.
> > For sequential writes at full speed, not so much.
> 
> Isn't more distributed I/O always favorable? Or is the problem the 4x overhead (1MB vs 4MB)?
>

As I said, in general yes. And since you're basically opening the flood
gates you're not seeing a difference between stripes.

Think of a less loaded scenario, you're writing 40MB sequentially. 
With 4MB size that will involve 10 primary OSDs (hitting the same OSD more
than once is of course a possibility) and the respective number of replica
OSDs. 
So up to 30/144 OSDs get busy, a ratio where your chances of OSD
contention (writes going to the same OSD, same OSD being the target
of primary and secondary writes) are pretty favorable.
With a 1MB size this will get up 120/144 OSDs busy, which would be nice if
it were perfectly distributed, but at your cluster size the chances of
contention have just gone up.

Contrary to this, if you had a 40MB database file and you're having lots
of small writes in random locations all over this file, the 1MB stripe has
of course the advantage, especially when it comes to latency.

Christian

> 
> Thanks for your helpful advice!
> Andreas
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com