Hello, On Wed, 15 Mar 2017 16:49:00 +0000 Rhian Resnick wrote: > Morning all, > > > We starting to apply load to our test cephfs system and are noticing some odd latency numbers. We are using erasure coding for the cold data pools and replication for our our cache tiers (not on ssd yet) . We noticed the following high latency on one node and it seams to be slowing down writes and reads on the cluster. > The pg dump below was massive overkill at this point in time, whereas a "ceph osd tree" would have probably shown us the topology (where is your tier, where your EC pool(s)?). Same for a "ceph osd pool ls detail". So if we were to assume that node is you cache tier (replica 1?), then the latencies would make sense. But that's guesswork, so describe your cluster in more detail. And yes, a single slow OSD (stealthily failing drive, etc) can bring a cluster to its knees. This is why many people here tend to get every last bit of info with collectd and feed it into carbon and graphite/grafana, etc. This will immediately indicate culprits and allow you to correlate this with other data, like actual disk/network/cpu load, etc. For the time being run atop on that node and see if you can reduce the issue to something like "all disk are busy all the time" or "CPU meltdown". > > Our next step is break out mds, mgr, and mons to different machines but we wanted to start the discussion here. > If your nodes (not a single iota of HW/NW info from you) are powerful enough, breaking out stuff isn't likely to help or a necessity. More below. > > Here is a bunch of information you may find useful. > > > ceph.conf > > [global] > fsid = XXXXX > mon_initial_members = ceph-mon1, ceph-mon2, ceph-mon3 > mon_host = 10.141.167.238,10.141.160.251,10.141.161.249 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > > cluster network = 10.85.8.0/22 > public network = 10.141.0.0/16 > > # we tested this with bluestore and xfs and have the same results > [osd] > enable_experimental_unrecoverable_data_corrupting_features = bluestore > I suppose this is not production in any shape or form. > Status > > cluster 8f6ba9d6-314d-4725-bcfa-340e500697f0 > health HEALTH_OK > monmap e2: 3 mons at {ceph-mon1=10.141.167.238:6789/0,ceph-mon2=10.141.160.251:6789/0,ceph-mon3=10.141.161.249:6789/0} > election epoch 12, quorum 0,1,2 ceph-mon2,ceph-mon3,ceph-mon1 > fsmap e30: 1/1/1 up {0=ceph-mon3=up:active}, 2 up:standby > mgr active: ceph-mon3 standbys: ceph-mon2, ceph-mon1 > osdmap e100: 12 osds: 12 up, 12 in > flags sortbitwise,require_jewel_osds,require_kraken_osds > pgmap v119525: 124 pgs, 6 pools, 471 GB data, 1141 kobjects > 970 GB used, 2231 GB / 3202 GB avail > 124 active+clean > client io 11962 B/s rd, 11 op/s rd, 0 op/s wr > At first glance there seem to be way too little PGs here, even given the low number of OSDs. > > Pool space usage > Irrelevant. > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 3202G 2231G 970G 30.31 > POOLS: > NAME ID USED %USED MAX AVAIL OBJECTS > rbd 0 0 0 580G 0 > cephfs-hot 1 76137M 11.35 580G 466451 > cephfs-cold 2 397G 25.48 1161G 650158 > cephfs_metadata 3 47237k 0 580G 52275 > one-hot 4 0 0 580G 0 > one 5 0 0 1161G 0 > An aside, how happy are you with OpenNebula and Ceph? I found that the lack of a migration network option in ON is a show stopper for us. > > OSD Performance and Latency > > osd commit_latency(ms) apply_latency(ms) > 9 1 1 > 8 1 1 > 0 13 13 > 11 1 1 > 1 38 38 > 10 2 2 > 2 21 21 > 3 2 2 > 4 20 20 > 5 1 1 > 6 1 1 > 7 1 1 > I found these counters to be less than reliable or at least relevant unless there is constant activity and they are read frequently as well. For example on a cluster where the HDD based OSDs are basically dormant write wise most of the time while the SSD based cache tier is very busy: --- 16 76 124 17 66 99 18 0 1 19 0 0 --- The first two are HDD OSD, the 2nd two are SSD OSDs in the cache tier. And I can assure that the HDD based OSDs (which have journal OSD and are really RAID10s behind a 4GB HW cache raid controller) are not that slow. [snip] Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com