Re: Uneven pg distribution cause high fs_apply_latency on osds with more pgs

"shadow_lin"<shadow_lin@xxxxxxx> · Fri, 9 Mar 2018 12:26:12 +0800

Thanks for your advice.
I will try to reweight osds of my cluster.

Why ceph is so sensitive to unblanced pg distribution during high load? 
ceph osd df result is: https://pastebin.com/ur4Q9jsA.  
ceph osd perf result is: https://pastebin.com/87DitPhV

There is no osd with very high pg count compare to others. When the 
wirte test load is low everything seems fine, but during high write load test, 
some of the osds with higher pg can have 3-10 time of fs_apply_latency compare 
to others. 

My guess is the high loaded osds kinda slowed the whole cluster（because I 
have only one pool with all osds）to the level of how fast they can handle. 
So other osd has lower load and have a good latency.

Is this expected during high load(Indicate the load is too hight 
for current cluster to hanlde)? 

How does luminous solve the unevenly pg distribution problem?I read about 
there is a pg-upmap exception table in the osdmap in luminous 12.2.x. It is 
said to use this it is possible to achive perfect pg distribution among 
osds.

2018-03-09 

shadow_lin 

  发件人：David Turner <drakonstein@xxxxxxxxx>
  发送时间：2018-03-09 06:45
  主题：Re: [ceph-users] Uneven pg distribution cause high 
  fs_apply_latency on osds with more pgs
  收件人："shadow_lin"<shadow_lin@xxxxxxx>
  抄送："ceph-users"<ceph-users@xxxxxxxxxxxxxx>

  PGs being unevenly distributed is a common occurrence in 
  Ceph.  Luminous started making some steps towards correcting this, but 
  you're in Jewel.  There are a lot of threads in the ML archives about 
  fixing PG distribution.  Generally every method comes down to increasing 
  the weight on OSDs with too few PGs and decreasing the weight on the OSDs with 
  too many PGs.  There are a lot of schools of thought on the best way to 
  implement this in your environment which has everything to do with your client 
  IO patterns and workloads.  Looking into `ceph osd reweight-by-pg` might 
  be a good place for you to start as you are only looking at 1 pool in your 
  cluster.  If you have more pools, you generally need `ceph osd 
  reweight-by-utilization`.

  On Wed, Mar 7, 2018 at 8:19 AM shadow_lin <shadow_lin@xxxxxxx> wrote:

    Hi list,
           Ceph version is jewel 10.2.10 and 
    all osd are using filestore.
    The Cluster has 96 osds and 1 
    pool with size=2 replication with 4096 pg(base on pg 
    calculate method from ceph doc for 100pg/per osd).
    The osd with the most pg count has 104 PGs and 
    there are 6 osds have above 100 PGs
    Most of the osd have around 7x-9x PGs
    The osd with the least pg count has 58 
    PGs

    During the write test some of the osds have 
    very high fs_apply_latency like 1000ms-4000ms while the normal ones are 
    like 100-600ms. The osds with high latency are always the ones 
    with more pg on it.

    iostat on the high latency osd shows the 
    hdds are having high %util at about 95%-96% while the normal ones are 
    having %util at 40%-60%

    I think the reason to cause this is because 
    the osds have more pgs need to handle more write request to it.Is this 
    right?
    But even though the pg distribution is not 
    even but the variation is not that much.How could the performance be so 
    sensitive to it?

    Is there anything I can do to improve the 
    performance and reduce the latency?

    How can I make the pg distribution to be more 
    even?

    Thanks

    2018-03-07

    shadowlin

_______________________________________________
ceph-users 
    mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com