Re: Slow performance into windows VM

K K <nnex@xxxxxxx> · Tue, 19 Jul 2016 06:17:33 +0300

robocopy in Windows have flag /MT:N, where N - thread's count. With MT:24 I have 20-30MB/sec copy from one VM instance to another. It's all after disabling scrub in working time.

	Вторник, 12 июля 2016, 5:44 +05:00 от Christian Balzer <chibi@xxxxxxx>:

Hello,

scrub settings will only apply to new scrubs, not running ones, as you

found out.

On Mon, 11 Jul 2016 15:37:49 +0300 K K wrote:

> 

> I have tested windows instance Crystal Disk Mark. Result is:

>

Again, when running a test like this, check with atop/iostat how your

OSDs/HDDs are doing

> Sequential Read : 43.049 MB/s

> Sequential Write : 45.181 MB/s

> Random Read 512KB : 78.660 MB/s

> Random Write 512KB : 39.292 MB/s

> Random Read 4KB (QD=1) : 3.511 MB/s [ 857.3 IOPS]

> Random Write 4KB (QD=1) : 1.380 MB/s [ 337.0 IOPS]

> Random Read 4KB (QD=32) : 32.220 MB/s [ 7866.1 IOPS]

> Random Write 4KB (QD=32) : 12.564 MB/s [ 3067.4 IOPS]

> Test : 4000 MB [D: 97.5% (15699.7/16103.1 GB)] (x3)

> 

These numbers aren't all that bad, with your network and w/o SSD journals

the 4KB ones are pretty much on par.

You may get better read performance by permanently enabling read-ahead, as

per:

http://docs.ceph.com/docs/hammer/rbd/rbd-config-ref/

Windows may have native settings to do that, but I know zilch about that.

Christian

> >Понедельник, 11 июля 2016, 12:38 +05:00 от Christian Balzer <chibi@xxxxxxx>:

> >

> >

> >Hello,

> >

> >On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:

> >

> >> 

> >> > I hope the fastest of these MONs (CPU and storage) has the lowest IP

> >> > number and thus is the leader.

> >> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all mons.

> >

> >In your use case and configuration no surprise, but again, the lowest IP

> >will be leader by default and thus the busiest. 

> >

> >> > Also what Ceph, OS, kernel version?

> >> 

> >> ubuntu 16.04 kernel 4.4.0-22

> >> 

> >Check the ML archives, I remember people having performance issues with the

> >4.4 kernels.

> >

> >Still don't know your Ceph version, is it the latest Jewel?

> >

> >> > Two GbE ports, given the "frontend" up there with the MON description I

> >> > assume that's 1 port per client (front) and cluster (back) network?

> >> yes, one GbE for ceph client, one GbE for back network.

> >OK, so (from a single GbE client) 100MB/s at most.

> >

> >> > Is there any other client on than that Windows VM on your Ceph cluster?

> >> Yes, another one instance but without load.

> >OK.

> >

> >> > Is Ceph understanding this now?

> >> > Other than that, the queue options aren't likely to do much good with pure

> >> >HDD OSDs.

> >> 

> >> I can't find those parameter in running config:

> >> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep "filestore_queue"

> >

> >These are OSD parameters, you need to query an OSD daemon. 

> >

> >> "filestore_queue_max_ops": "3000",

> >> "filestore_queue_max_bytes": "1048576000",

> >> "filestore_queue_max_delay_multiple": "0",

> >> "filestore_queue_high_delay_multiple": "0",

> >> "filestore_queue_low_threshhold": "0.3",

> >> "filestore_queue_high_threshhold": "0.9",

> >> > That should be 512, 1024 really with one RBD pool.

> >> 

> >> Yes, I know. Today for test I added scbench pool with 128 pg

> >> There are output status and osd tree:

> >> ceph status

> >> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c

> >> health HEALTH_OK

> >> monmap e6: 3 mons at {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}

> >> election epoch 238, quorum 0,1,2 block01,object01,object02

> >> osdmap e6887: 18 osds: 18 up, 18 in

> >> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects

> >> 35049 GB used, 15218 GB / 50267 GB avail

> >> 1275 active+clean

> >> 3 active+clean+scrubbing+deep

> >> 2 active+clean+scrubbing

> >>

> >Check the ML archives and restrict scrubs to off-peak hours as well as

> >tune things to keep their impact low.

> >

> >Scrubbing is a major performance killer, especially on non-SSD journal

> >OSDs and with older Ceph versions and/or non-tuned parameters:

> >---

> >osd_scrub_end_hour = 6

> >osd_scrub_load_threshold = 2.5

> >osd_scrub_sleep = 0.1

> >---

> >

> >> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr

> >> 

> >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 

> >> -1 54.00000 root default 

> >> -2 27.00000 host cn802 

> >> 0 3.00000 osd.0 up 1.00000 1.00000 

> >> 2 3.00000 osd.2 up 1.00000 1.00000 

> >> 4 3.00000 osd.4 up 1.00000 1.00000 

> >> 6 3.00000 osd.6 up 0.89995 1.00000 

> >> 8 3.00000 osd.8 up 1.00000 1.00000 

> >> 10 3.00000 osd.10 up 1.00000 1.00000 

> >> 12 3.00000 osd.12 up 0.89999 1.00000 

> >> 16 3.00000 osd.16 up 1.00000 1.00000 

> >> 18 3.00000 osd.18 up 0.90002 1.00000 

> >> -3 27.00000 host cn803 

> >> 1 3.00000 osd.1 up 1.00000 1.00000 

> >> 3 3.00000 osd.3 up 0.95316 1.00000 

> >> 5 3.00000 osd.5 up 1.00000 1.00000 

> >> 7 3.00000 osd.7 up 1.00000 1.00000 

> >> 9 3.00000 osd.9 up 1.00000 1.00000 

> >> 11 3.00000 osd.11 up 0.95001 1.00000 

> >> 13 3.00000 osd.13 up 1.00000 1.00000 

> >> 17 3.00000 osd.17 up 0.84999 1.00000 

> >> 19 3.00000 osd.19 up 1.00000 1.00000

> >> > Wrong way to test this, test it from a monitor node, another client node

> >> > (like your openstack nodes).

> >> > In your 2 node cluster half of the reads or writes will be local, very

> >> > much skewing your results.

> >> I have been tested from copmute node also and have same result. 80-100Mb/sec

> >> 

> >That's about as good as it gets (not 148MB/s, though!).

> >But rados bench is not the same as real client I/O.

> >

> >> > Very high max latency, telling us that your cluster ran out of steam at

> >> some point.

> >> 

> >> I copying data from my windows instance right now.

> >

> >Re-do any testing when you've stopped all scrubbing.

> >

> >> > I'd de-frag anyway, just to rule that out.

> >> 

> >> 

> >> >When doing your tests or normal (busy) operations from the client VM, run

> >> > atop on your storage nodes and observe your OSD HDDs. 

> >> > Do they get busy, around 100%?

> >> 

> >> Yes, high IO load (600-800 io).  But this is very strange on SATA HDD. All HDD have own OSD daemon and presented in OS as hardware RAID0(each block node have hardware RAID). Example:

> >

> >Your RAID controller and its HW cache are likely to help with that speed,

> >also all of these are reads, most likely the scrubs above, not a single

> >write to be seen.

> >

> >> avg-cpu: %user %nice %system %iowait %steal %idle

> >> 1.44 0.00 3.56 17.56 0.00 77.44

> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

> >> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

> >> sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 81.60

> >> sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

> >> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

> >> sdf 0.00 0.00 761.00 0.00 94308.00 0.00 247.85 8.66 11.26 11.26 0.00 1.18 90.00

> >> sdg 0.00 0.00 761.00 0.00 97408.00 0.00 256.00 7.80 10.22 10.22 0.00 1.01 76.80

> >> sdh 0.00 0.00 801.00 0.00 102344.00 0.00 255.54 8.05 10.05 10.05 0.00 0.96 76.80

> >> sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

> >> sdj 0.00 0.00 537.00 0.00 68736.00 0.00 256.00 5.54 10.26 10.26 0.00 0.98 52.80

> >> 

> >> 

> >> > Check with iperf or NPtcp that your network to the clients from the

> >> > storage nodes is fully functional. 

> >> The network have been tested by iperf. 950-970Mbit among all nodes in clustes (openstack and ceph) 

> >

> >Didn't think it was that, one thing off the list to check.

> >

> >Christian

> >

> >Понедельник, 11 июля 2016, 10:58 +05:00 от Christian Balzer

> >< chibi@xxxxxxx >:

> >> >

> >> >

> >> >Hello,

> >> >

> >> >On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote:

> >> >

> >> >> 

> >> >> Hello, guys

> >> >> 

> >> >> I to face a task poor performance into windows 2k12r2 instance running

> >> >> on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster

> >> >> consist from:

> >> >> - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM,

> >> >> Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet

> >> >> od Ceph cluster

> >> >

> >> >I hope the fastest of these MONs (CPU and storage) has the lowest IP

> >> >number and thus is the leader.

> >> >

> >> >Also what Ceph, OS, kernel version?

> >> >

> >> >> - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have

> >> >> 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18

> >> >> OSD daemons on 2 nodes. 

> >> >

> >> >Two GbE ports, given the "frontend" up there with the MON description I

> >> >assume that's 1 port per client (front) and cluster (back) network?

> >> >

> >> >>Journals placed on same HDD as a rados data. I

> >> >> know that better using for those purpose separate SSD disk. 

> >> >Indeed...

> >> >

> >> >>When I test

> >> >> new windows instance performance was good (read/write something about

> >> >> 100Mb/sec). But after I copied 16Tb data to windows instance read

> >> >> performance has down to 10Mb/sec. Type of data on VM - image and video.

> >> >> 

> >> >100MB/s would be absolute perfect with the setup you have, assuming no

> >> >contention (other clients).

> >> >

> >> >Is there any other client on than that Windows VM on your Ceph cluster?

> >> >

> >> >> ceph.conf on client side:

> >> >> [global]

> >> >> auth cluster required = cephx

> >> >> auth service required = cephx

> >> >> auth client required = cephx

> >> >> filestore xattr use omap = true

> >> >> filestore max sync interval = 10

> >> >> filestore queue max ops = 3000

> >> >> filestore queue commiting max bytes = 1048576000

> >> >> filestore queue commiting max ops = 5000

> >> >> filestore queue max bytes = 1048576000

> >> >> filestore queue committing max ops = 4096

> >> >> filestore queue committing max bytes = 16 MiB

> >> >                                            ^^^

> >> >Is Ceph understanding this now?

> >> >Other than that, the queue options aren't likely to do much good with pure

> >> >HDD OSDs.

> >> >

> >> >> filestore op threads = 20

> >> >> filestore flusher = false

> >> >> filestore journal parallel = false

> >> >> filestore journal writeahead = true

> >> >> journal dio = true

> >> >> journal aio = true

> >> >> journal force aio = true

> >> >> journal block align = true

> >> >> journal max write bytes = 1048576000

> >> >> journal_discard = true

> >> >> osd pool default size = 2 # Write an object n times.

> >> >> osd pool default min size = 1

> >> >> osd pool default pg num = 333

> >> >> osd pool default pgp num = 333

> >> >That should be 512, 1024 really with one RBD pool.

> >> > http://ceph.com/pgcalc/

> >> >

> >> >> osd crush chooseleaf type = 1

> >> >> 

> >> >> [client]

> >> >> rbd cache = true

> >> >> rbd cache size = 67108864

> >> >> rbd cache max dirty = 50331648

> >> >> rbd cache target dirty = 33554432

> >> >> rbd cache max dirty age = 2

> >> >> rbd cache writethrough until flush = true

> >> >> 

> >> >> 

> >> >> rados bench show from block node show:

> >> >Wrong way to test this, test it from a monitor node, another client node

> >> >(like your openstack nodes).

> >> >In your 2 node cluster half of the reads or writes will be local, very

> >> >much skewing your results.

> >> >

> >> >> rados bench -p scbench 120 write --no-cleanup

> >> >

> >> >Default tests with 4MB "blocks", what are the writes or reads from you

> >> >client VM like?

> >> >

> >> >> Total time run: 120.399337

> >> >> Total writes made: 3538

> >> >> Write size: 4194304

> >> >> Object size: 4194304

> >> >> Bandwidth (MB/sec): 117.542

> >> >> Stddev Bandwidth: 9.31244

> >> >> Max bandwidth (MB/sec): 148 

> >> >                          ^^^

> >> >That wouldn't be possible from an external client.

> >> >

> >> >> Min bandwidth (MB/sec): 92

> >> >> Average IOPS: 29

> >> >> Stddev IOPS: 2

> >> >> Max IOPS: 37

> >> >> Min IOPS: 23

> >> >> Average Latency(s): 0.544365

> >> >> Stddev Latency(s): 0.35825

> >> >> Max latency(s): 5.42548

> >> >Very high max latency, telling us that your cluster ran out of steam at

> >> >some point.

> >> >

> >> >> Min latency(s): 0.101533

> >> >> 

> >> >> rados bench -p scbench 120 seq

> >> >> Total time run: 120.880920

> >> >> Total reads made: 1932

> >> >> Read size: 4194304

> >> >> Object size: 4194304

> >> >> Bandwidth (MB/sec): 63.9307

> >> >> Average IOPS 15

> >> >> Stddev IOPS: 3

> >> >> Max IOPS: 25

> >> >> Min IOPS: 5

> >> >> Average Latency(s): 0.999095

> >> >> Max latency(s): 8.50774

> >> >> Min latency(s): 0.0391591

> >> >> 

> >> >> rados bench -p scbench 120 rand

> >> >> Total time run: 121.059005

> >> >> Total reads made: 1920

> >> >> Read size: 4194304

> >> >> Object size: 4194304

> >> >> Bandwidth (MB/sec): 63.4401

> >> >> Average IOPS: 15

> >> >> Stddev IOPS: 4

> >> >> Max IOPS: 26

> >> >> Min IOPS: 1

> >> >> Average Latency(s): 1.00785

> >> >> Max latency(s): 6.48138

> >> >> Min latency(s): 0.038925

> >> >> 

> >> >> On XFS partitions fragmentation no more than 1%

> >> >I'd de-frag anyway, just to rule that out.

> >> >

> >> >When doing your tests or normal (busy) operations from the client VM, run

> >> >atop on your storage nodes and observe your OSD HDDs. 

> >> >Do they get busy, around 100%?

> >> >

> >> >Check with iperf or NPtcp that your network to the clients from the

> >> >storage nodes is fully functional. 

> >> >

> >> >Christian

> >> >-- 

> >> >Christian Balzer        Network/Systems Engineer 

> >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications

> >> > http://www.gol.com/

> >> 

> >

> >

> >-- 

> >Christian Balzer        Network/Systems Engineer 

> >chibi@xxxxxxx Global OnLine Japan/Rakuten Communications

> >http://www.gol.com/

> 

-- 

Christian Balzer        Network/Systems Engineer                

chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com