Regarding opennebula it is working, we do find the network functionality less then flexible. We would prefer the orchestration layer allow each primary group to create a network infrastructure internally to meet their needs and then automatically provide nat from one or more public ip addresses (think aws and azure). This doesn't seem to be implemented at this time and will likely require manual intervention per group of users to resolve. Otherwise we like the software and find it much more lightweight then openstack. We need a tool that can be managed by a very small team and opennebula meets that goal.
Thanks for checking this out this data for our test cluster, it isn't production so yes we are throwing the spaghetti on the wall trying to make sure our we are able to handle issues as they come up.
We already planned to increase the pg count and have done so. (thanks)
Here is our osd tree, as this is test we are currently sharing the osd disks for cache tier (replica 3) and data (erasure), some more hardware is on the way so we can test the using SSD's.
We have been reviewing atop, iostat, sar, and our snmp monitoring (not granular enough) and have confirmed the disks on this particular node are under a higher load then the others. We will likely take the time to deploy graphite since it will help with another project also. On speculation that was discussed this morning is a bad cache battery on the perc card in ceph-mon1 which could explain the +10 ms latency we see on all for drives. (Wouldn't be ceph at all in this case)
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 3.12685 root default -2 1.08875 host ceph-mon1 0 0.27219 osd.0 up 1.00000 1.00000 1 0.27219 osd.1 up 1.00000 1.00000 2 0.27219 osd.2 up 1.00000 1.00000 4 0.27219 osd.4 up 1.00000 1.00000 -3 0.94936 host ceph-mon2 3 0.27219 osd.3 up 1.00000 1.00000 5 0.27219 osd.5 up 1.00000 1.00000 7 0.27219 osd.7 up 1.00000 1.00000 9 0.13280 osd.9 up 1.00000 1.00000 -4 1.08875 host ceph-mon3 6 0.27219 osd.6 up 1.00000 1.00000 8 0.27219 osd.8 up 1.00000 1.00000 10 0.27219 osd.10 up 1.00000 1.00000 11 0.27219 osd.11 up 1.00000 1.00000
From: Christian Balzer <chibi@xxxxxxx>
Sent: Wednesday, March 15, 2017 8:31 PM To: ceph-users@xxxxxxxxxxxxxx Cc: Rhian Resnick Subject: Re: Odd latency numbers Hello, On Wed, 15 Mar 2017 16:49:00 +0000 Rhian Resnick wrote: > Morning all, > > > We starting to apply load to our test cephfs system and are noticing some odd latency numbers. We are using erasure coding for the cold data pools and replication for our our cache tiers (not on ssd yet) . We noticed the following high latency on one node and it seams to be slowing down writes and reads on the cluster. > The pg dump below was massive overkill at this point in time, whereas a "ceph osd tree" would have probably shown us the topology (where is your tier, where your EC pool(s)?). Same for a "ceph osd pool ls detail". So if we were to assume that node is you cache tier (replica 1?), then the latencies would make sense. But that's guesswork, so describe your cluster in more detail. And yes, a single slow OSD (stealthily failing drive, etc) can bring a cluster to its knees. This is why many people here tend to get every last bit of info with collectd and feed it into carbon and graphite/grafana, etc. This will immediately indicate culprits and allow you to correlate this with other data, like actual disk/network/cpu load, etc. For the time being run atop on that node and see if you can reduce the issue to something like "all disk are busy all the time" or "CPU meltdown". > > Our next step is break out mds, mgr, and mons to different machines but we wanted to start the discussion here. > If your nodes (not a single iota of HW/NW info from you) are powerful enough, breaking out stuff isn't likely to help or a necessity. More below. > > Here is a bunch of information you may find useful. > > > ceph.conf > > [global] > fsid = XXXXX > mon_initial_members = ceph-mon1, ceph-mon2, ceph-mon3 > mon_host = 10.141.167.238,10.141.160.251,10.141.161.249 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > > cluster network = 10.85.8.0/22 > public network = 10.141.0.0/16 > > # we tested this with bluestore and xfs and have the same results > [osd] > enable_experimental_unrecoverable_data_corrupting_features = bluestore > I suppose this is not production in any shape or form. > Status > > cluster 8f6ba9d6-314d-4725-bcfa-340e500697f0 > health HEALTH_OK > monmap e2: 3 mons at {ceph-mon1=10.141.167.238:6789/0,ceph-mon2=10.141.160.251:6789/0,ceph-mon3=10.141.161.249:6789/0} > election epoch 12, quorum 0,1,2 ceph-mon2,ceph-mon3,ceph-mon1 > fsmap e30: 1/1/1 up {0=ceph-mon3=up:active}, 2 up:standby > mgr active: ceph-mon3 standbys: ceph-mon2, ceph-mon1 > osdmap e100: 12 osds: 12 up, 12 in > flags sortbitwise,require_jewel_osds,require_kraken_osds > pgmap v119525: 124 pgs, 6 pools, 471 GB data, 1141 kobjects > 970 GB used, 2231 GB / 3202 GB avail > 124 active+clean > client io 11962 B/s rd, 11 op/s rd, 0 op/s wr > At first glance there seem to be way too little PGs here, even given the low number of OSDs. > > Pool space usage > Irrelevant. > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 3202G 2231G 970G 30.31 > POOLS: > NAME ID USED %USED MAX AVAIL OBJECTS > rbd 0 0 0 580G 0 > cephfs-hot 1 76137M 11.35 580G 466451 > cephfs-cold 2 397G 25.48 1161G 650158 > cephfs_metadata 3 47237k 0 580G 52275 > one-hot 4 0 0 580G 0 > one 5 0 0 1161G 0 > An aside, how happy are you with OpenNebula and Ceph? I found that the lack of a migration network option in ON is a show stopper for us. > > OSD Performance and Latency > > osd commit_latency(ms) apply_latency(ms) > 9 1 1 > 8 1 1 > 0 13 13 > 11 1 1 > 1 38 38 > 10 2 2 > 2 21 21 > 3 2 2 > 4 20 20 > 5 1 1 > 6 1 1 > 7 1 1 > I found these counters to be less than reliable or at least relevant unless there is constant activity and they are read frequently as well. For example on a cluster where the HDD based OSDs are basically dormant write wise most of the time while the SSD based cache tier is very busy: --- 16 76 124 17 66 99 18 0 1 19 0 0 --- The first two are HDD OSD, the 2nd two are SSD OSDs in the cache tier. And I can assure that the HDD based OSDs (which have journal OSD and are really RAID10s behind a 4GB HW cache raid controller) are not that slow. [snip] Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com