Re: Slow performance into windows VM

Christian Balzer <chibi@xxxxxxx> · Mon, 11 Jul 2016 16:38:12 +0900

Hello,

On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:

> 
> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
> > number and thus is the leader.
> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all mons.

In your use case and configuration no surprise, but again, the lowest IP
will be leader by default and thus the busiest. 

> > Also what Ceph, OS, kernel version?
> 
> ubuntu 16.04 kernel 4.4.0-22
> 
Check the ML archives, I remember people having performance issues with the
4.4 kernels.

Still don't know your Ceph version, is it the latest Jewel?

> > Two GbE ports, given the "frontend" up there with the MON description I
> > assume that's 1 port per client (front) and cluster (back) network?
> yes, one GbE for ceph client, one GbE for back network.
OK, so (from a single GbE client) 100MB/s at most.

> > Is there any other client on than that Windows VM on your Ceph cluster?
> Yes, another one instance but without load.
OK.

> > Is Ceph understanding this now?
> > Other than that, the queue options aren't likely to do much good with pure
> >HDD OSDs.
> 
> I can't find those parameter in running config:
> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep "filestore_queue"

These are OSD parameters, you need to query an OSD daemon. 

> "filestore_queue_max_ops": "3000",
> "filestore_queue_max_bytes": "1048576000",
> "filestore_queue_max_delay_multiple": "0",
> "filestore_queue_high_delay_multiple": "0",
> "filestore_queue_low_threshhold": "0.3",
> "filestore_queue_high_threshhold": "0.9",
> > That should be 512, 1024 really with one RBD pool.
> 
> Yes, I know. Today for test I added scbench pool with 128 pg
> There are output status and osd tree:
> ceph status
> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
> health HEALTH_OK
> monmap e6: 3 mons at {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
> election epoch 238, quorum 0,1,2 block01,object01,object02
> osdmap e6887: 18 osds: 18 up, 18 in
> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
> 35049 GB used, 15218 GB / 50267 GB avail
> 1275 active+clean
> 3 active+clean+scrubbing+deep
> 2 active+clean+scrubbing
>
Check the ML archives and restrict scrubs to off-peak hours as well as
tune things to keep their impact low.

Scrubbing is a major performance killer, especially on non-SSD journal
OSDs and with older Ceph versions and/or non-tuned parameters:
---
osd_scrub_end_hour = 6
osd_scrub_load_threshold = 2.5
osd_scrub_sleep = 0.1
---

> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
> 
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 54.00000 root default 
> -2 27.00000 host cn802 
> 0 3.00000 osd.0 up 1.00000 1.00000 
> 2 3.00000 osd.2 up 1.00000 1.00000 
> 4 3.00000 osd.4 up 1.00000 1.00000 
> 6 3.00000 osd.6 up 0.89995 1.00000 
> 8 3.00000 osd.8 up 1.00000 1.00000 
> 10 3.00000 osd.10 up 1.00000 1.00000 
> 12 3.00000 osd.12 up 0.89999 1.00000 
> 16 3.00000 osd.16 up 1.00000 1.00000 
> 18 3.00000 osd.18 up 0.90002 1.00000 
> -3 27.00000 host cn803 
> 1 3.00000 osd.1 up 1.00000 1.00000 
> 3 3.00000 osd.3 up 0.95316 1.00000 
> 5 3.00000 osd.5 up 1.00000 1.00000 
> 7 3.00000 osd.7 up 1.00000 1.00000 
> 9 3.00000 osd.9 up 1.00000 1.00000 
> 11 3.00000 osd.11 up 0.95001 1.00000 
> 13 3.00000 osd.13 up 1.00000 1.00000 
> 17 3.00000 osd.17 up 0.84999 1.00000 
> 19 3.00000 osd.19 up 1.00000 1.00000
> > Wrong way to test this, test it from a monitor node, another client node
> > (like your openstack nodes).
> > In your 2 node cluster half of the reads or writes will be local, very
> > much skewing your results.
> I have been tested from copmute node also and have same result. 80-100Mb/sec
> 
That's about as good as it gets (not 148MB/s, though!).
But rados bench is not the same as real client I/O.

> > Very high max latency, telling us that your cluster ran out of steam at
> some point.
> 
> I copying data from my windows instance right now.

Re-do any testing when you've stopped all scrubbing.

> > I'd de-frag anyway, just to rule that out.
> 
> 
> >When doing your tests or normal (busy) operations from the client VM, run
> > atop on your storage nodes and observe your OSD HDDs. 
> > Do they get busy, around 100%?
> 
> Yes, high IO load (600-800 io).  But this is very strange on SATA HDD. All HDD have own OSD daemon and presented in OS as hardware RAID0(each block node have hardware RAID). Example:

Your RAID controller and its HW cache are likely to help with that speed,
also all of these are reads, most likely the scrubs above, not a single
write to be seen.

> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.44 0.00 3.56 17.56 0.00 77.44
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 81.60
> sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdf 0.00 0.00 761.00 0.00 94308.00 0.00 247.85 8.66 11.26 11.26 0.00 1.18 90.00
> sdg 0.00 0.00 761.00 0.00 97408.00 0.00 256.00 7.80 10.22 10.22 0.00 1.01 76.80
> sdh 0.00 0.00 801.00 0.00 102344.00 0.00 255.54 8.05 10.05 10.05 0.00 0.96 76.80
> sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 537.00 0.00 68736.00 0.00 256.00 5.54 10.26 10.26 0.00 0.98 52.80
> 
> 
> > Check with iperf or NPtcp that your network to the clients from the
> > storage nodes is fully functional. 
> The network have been tested by iperf. 950-970Mbit among all nodes in clustes (openstack and ceph) 

Didn't think it was that, one thing off the list to check.

Christian

Понедельник, 11 июля 2016, 10:58 +05:00 от Christian Balzer
<chibi@xxxxxxx>:
> >
> >
> >Hello,
> >
> >On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote:
> >
> >> 
> >> Hello, guys
> >> 
> >> I to face a task poor performance into windows 2k12r2 instance running
> >> on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster
> >> consist from:
> >> - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM,
> >> Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet
> >> od Ceph cluster
> >
> >I hope the fastest of these MONs (CPU and storage) has the lowest IP
> >number and thus is the leader.
> >
> >Also what Ceph, OS, kernel version?
> >
> >> - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have
> >> 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18
> >> OSD daemons on 2 nodes. 
> >
> >Two GbE ports, given the "frontend" up there with the MON description I
> >assume that's 1 port per client (front) and cluster (back) network?
> >
> >>Journals placed on same HDD as a rados data. I
> >> know that better using for those purpose separate SSD disk. 
> >Indeed...
> >
> >>When I test
> >> new windows instance performance was good (read/write something about
> >> 100Mb/sec). But after I copied 16Tb data to windows instance read
> >> performance has down to 10Mb/sec. Type of data on VM - image and video.
> >> 
> >100MB/s would be absolute perfect with the setup you have, assuming no
> >contention (other clients).
> >
> >Is there any other client on than that Windows VM on your Ceph cluster?
> >
> >> ceph.conf on client side:
> >> [global]
> >> auth cluster required = cephx
> >> auth service required = cephx
> >> auth client required = cephx
> >> filestore xattr use omap = true
> >> filestore max sync interval = 10
> >> filestore queue max ops = 3000
> >> filestore queue commiting max bytes = 1048576000
> >> filestore queue commiting max ops = 5000
> >> filestore queue max bytes = 1048576000
> >> filestore queue committing max ops = 4096
> >> filestore queue committing max bytes = 16 MiB
> >                                            ^^^
> >Is Ceph understanding this now?
> >Other than that, the queue options aren't likely to do much good with pure
> >HDD OSDs.
> >
> >> filestore op threads = 20
> >> filestore flusher = false
> >> filestore journal parallel = false
> >> filestore journal writeahead = true
> >> journal dio = true
> >> journal aio = true
> >> journal force aio = true
> >> journal block align = true
> >> journal max write bytes = 1048576000
> >> journal_discard = true
> >> osd pool default size = 2 # Write an object n times.
> >> osd pool default min size = 1
> >> osd pool default pg num = 333
> >> osd pool default pgp num = 333
> >That should be 512, 1024 really with one RBD pool.
> >http://ceph.com/pgcalc/
> >
> >> osd crush chooseleaf type = 1
> >> 
> >> [client]
> >> rbd cache = true
> >> rbd cache size = 67108864
> >> rbd cache max dirty = 50331648
> >> rbd cache target dirty = 33554432
> >> rbd cache max dirty age = 2
> >> rbd cache writethrough until flush = true
> >> 
> >> 
> >> rados bench show from block node show:
> >Wrong way to test this, test it from a monitor node, another client node
> >(like your openstack nodes).
> >In your 2 node cluster half of the reads or writes will be local, very
> >much skewing your results.
> >
> >> rados bench -p scbench 120 write --no-cleanup
> >
> >Default tests with 4MB "blocks", what are the writes or reads from you
> >client VM like?
> >
> >> Total time run: 120.399337
> >> Total writes made: 3538
> >> Write size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 117.542
> >> Stddev Bandwidth: 9.31244
> >> Max bandwidth (MB/sec): 148 
> >                          ^^^
> >That wouldn't be possible from an external client.
> >
> >> Min bandwidth (MB/sec): 92
> >> Average IOPS: 29
> >> Stddev IOPS: 2
> >> Max IOPS: 37
> >> Min IOPS: 23
> >> Average Latency(s): 0.544365
> >> Stddev Latency(s): 0.35825
> >> Max latency(s): 5.42548
> >Very high max latency, telling us that your cluster ran out of steam at
> >some point.
> >
> >> Min latency(s): 0.101533
> >> 
> >> rados bench -p scbench 120 seq
> >> Total time run: 120.880920
> >> Total reads made: 1932
> >> Read size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 63.9307
> >> Average IOPS 15
> >> Stddev IOPS: 3
> >> Max IOPS: 25
> >> Min IOPS: 5
> >> Average Latency(s): 0.999095
> >> Max latency(s): 8.50774
> >> Min latency(s): 0.0391591
> >> 
> >> rados bench -p scbench 120 rand
> >> Total time run: 121.059005
> >> Total reads made: 1920
> >> Read size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 63.4401
> >> Average IOPS: 15
> >> Stddev IOPS: 4
> >> Max IOPS: 26
> >> Min IOPS: 1
> >> Average Latency(s): 1.00785
> >> Max latency(s): 6.48138
> >> Min latency(s): 0.038925
> >> 
> >> On XFS partitions fragmentation no more than 1%
> >I'd de-frag anyway, just to rule that out.
> >
> >When doing your tests or normal (busy) operations from the client VM, run
> >atop on your storage nodes and observe your OSD HDDs. 
> >Do they get busy, around 100%?
> >
> >Check with iperf or NPtcp that your network to the clients from the
> >storage nodes is fully functional. 
> >
> >Christian
> >-- 
> >Christian Balzer        Network/Systems Engineer 
> >chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com