Re: how to improve performance

Rudi Ahlers <rudiahlers@xxxxxxxxx> · Tue, 21 Nov 2017 14:12:22 +0200

On Tue, Nov 21, 2017 at 10:46 AM, Christian Balzer <chibi@xxxxxxx> wrote:
On Tue, 21 Nov 2017 09:21:58 +0200 Rudi Ahlers wrote:

> On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi@xxxxxxx> wrote:

>

> > On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:

> >

> > > We're planning on installing 12X Virtual Machines with some heavy loads.

> > >

> > > the SSD drives are  INTEL SSDSC2BA400G4

> > >

> > Interesting, where did you find those?

> > Or did you have them lying around?

> >

> > I've been unable to get DC S3710 SSDs for nearly a year now.

> >

>

> In South Africa, one of our suppliers had some in stock. They're still

> fairly new, about 2 months old now.

>

>

Odd, oh well.

>

>

> > The SATA drives are ST8000NM0055-1RM112

> > >

> > Note that these (while fast) have an internal flash cache, limiting them to

> > something like 0.2 DWPD.

> > Probably not an issue with the WAL/DB on the Intels, but something to keep

> > in mind.

> >

>

>

> I don't quite understand what you want to say, please explain?

>

See the other mails in this thread after the one above.

In short, probably nothing to worry about.

>

>

> > > Please explain your comment, "b) will find a lot of people here who don't

> > > approve of it."

> > >

> > Read the archives.

> > Converged clusters are complex and debugging Ceph when tons of other

> > things are going on at the same time on the machine even more so.

> >

>

>

> Ok, so I have 4 physical servers and need to setup a highly redundant

> cluster. How else would you have done it? There is no budget for a SAN, let

> alone a highly available SAN.

>

As I said, I'd be fine doing it with Ceph, if that was a good match.

It's easy to starve resources with hyperconverged clusters.

Since you're using proxmox, DRBD would be an obvious alternative,

especially if you're not planning on growing this cluster.

You only mentioned 3 servers so far, is the fourth one non-Ceph?

From what I have read, DRBD isn't very stable?

The 4th one will be for backups. 

>

>

> >

> > > I don't have access to the switches right now, but they're new so

> > whatever

> > > default config ships from factory would be active. Though iperf shows

> > 10.5

> > > GBytes  / 9.02 Gbits/sec throughput.

> > >

> > Didn't think it was the switches, but completeness sake and all that.

> >

> > > What speeds would you expect?

> > > "Though with your setup I would have expected something faster, but NOT

> > the

> > > theoretical 600MB/s 4 HDDs will do in sequential writes."

> > >

> > What I wrote.

> > A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in

> > the most optimal circumstances.

> > So your cluster can NOT exceed about 600MB/s sustained writes with the

> > effective bandwidth of 4 HDDs.

> > Smaller writes/reads that can be cached by RAM, DB, onboard caches on the

> > HDDs of course can and will be faster.

> >

> > But again, you're missing the point, even if you get 600MB/s writes out of

> > your cluster, the number of 4k IOPS will be much more relevant to your VMs.

> >

> >

> hdparm shows about 230MB/s:

>

> ^Croot@virt2:~# hdparm -Tt /dev/sda

>

> /dev/sda:

>  Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec

>  Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec

>

That's read and a very optimized sequential one at that.

>

>

> 600MB/s would be super nice, but in reality even 400MB/s would be nice.

Do you really need to write that amount of data in a short time?

Typical VMs are IOPS bound, as pointed out several times.

We have 10x physical servers which are quite busy and two of them are slow in terms of disk speed so I am looking at getting better performance. 

> Would it not be achievable?

>

Maybe, but you need to find out what, if anything makes your cluster

slower than this.

iostat, atop, etc can help with that.

How busy are your CPUs, HDDs and SSDs when you run that benchmark?

The CPU and RAM is fairly "idle" during any of my tests. 

>

>

> > >

> > >

> > > On this, "If an OSD has no fast WAL/DB, it will drag the overall speed

> > > down. Verify and if so fix this and re-test.": how?

> > >

> > No idea, I don't do bluestore.

> > You noticed the lack of a WAL/DB for sda, go and fix it.

> > If in in doubt by destroying and re-creating.

> >

> > And if you're looking for a less invasive procedure, docs and the ML

> > archive, but AFAIK there is nothing but re-creation at this time.

> >

>

>

> Since I use Proxmox, which setup a DB device, but not a WAL device.

>

Again, I don't do bluestore.

But AFAIK, WAL will live on the fastest device, which is the SSD you've

put the DB on, unless specified separately.

So nothing to be done here.

I have re-created the CEPH pool with a DB and WAL device this time and performance is slightly better:

root@virt2:~#  ceph-disk list | grep /dev/sdf | grep osd
 /dev/sdb1 ceph data, active, cluster ceph, osd.5, block /dev/sdb2, block.db /dev/sdf1, block.wal /dev/sdf2
 /dev/sdd1 ceph data, active, cluster ceph, osd.7, block /dev/sdd2, block.db /dev/sdf3, block.wal /dev/sdf4

root@virt2:~#  ceph-disk list | grep /dev/sde | grep osd
 /dev/sda1 ceph data, active, cluster ceph, osd.4, block /dev/sda2, block.db /dev/sde1, block.wal /dev/sde2
 /dev/sdc1 ceph data, active, cluster ceph, osd.6, block /dev/sdc2, block.db /dev/sde3, block.wal /dev/sde4

root@virt2:~#  rados bench -p Data 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       311       295   1179.73      1180   0.0498938   0.0520793
    2      16       622       606   1211.78      1244      0.0358   0.0511329
    3      16       934       918    1223.8      1248   0.0587524   0.0506744
Total time run:       3.420127
Total reads made:     986
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1153.17
Average IOPS:         288
Stddev IOPS:          9
Max IOPS:             312
Min IOPS:             295
Average Latency(s):   0.053413
Max latency(s):       0.284069
Min latency(s):       0.0166523

root@virt2:~# rados bench -p Data 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       381       365   1459.69      1460  0.00267135     0.04159
    2      15       715       700   1399.75      1340   0.0934119   0.0441607
    3      15      1079      1064   1418.44      1456  0.00258879   0.0435526
    4      16      1448      1432   1431.77      1472    0.134513   0.0435446
    5      16      1862      1846   1476.56      1656    0.017519    0.042301
    6      16      2192      2176   1450.44      1320  0.00885603   0.0427858
    7      16      2558      2542   1452.35      1464  0.00184139   0.0429065
    8      16      2996      2980   1489.78      1752   0.0103593     0.04178
    9      16      3385      3369   1497.12      1556  0.00866541    0.041612
   10      16      3744      3728   1490.99      1436  0.00246718   0.0420014
Total time run:       10.204271
Total reads made:     3744
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1467.62
Average IOPS:         366
Stddev IOPS:          33
Max IOPS:             438
Min IOPS:             330
Average Latency(s):   0.0427017
Max latency(s):       0.453643
Min latency(s):       0.00143035

root@virt2:~# rados bench -p Data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_virt2_20816
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       106        90   359.981       360    0.211947    0.164055
    2      16       202       186   371.956       384    0.101829    0.161727
    3      16       312       296   394.616       440    0.142682    0.157926
    4      16       414       398   397.946       408     0.17893    0.157207
    5      16       515       499   399.147       404    0.138521    0.157384
    6      16       609       593   395.281       376    0.197496    0.159185
    7      16       703       687   392.521       376    0.148057    0.160965
    8      16       796       780   389.952       372    0.360846    0.161464
    9      16       907       891   395.951       444   0.0697599    0.160687
   10      16       989       973   389.153       328    0.164584    0.161334
Total time run:         10.125151
Total writes made:      990
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     391.105
Stddev Bandwidth:       35.6302
Max bandwidth (MB/sec): 444
Min bandwidth (MB/sec): 328
Average IOPS:           97
Stddev IOPS:            8
Max IOPS:               111
Min IOPS:               82
Average Latency(s):     0.163488
Stddev Latency(s):      0.0623322
Max latency(s):         0.451163
Min latency(s):         0.0416428

As noted the IOPS is still very very low. What could cause that?

-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com