Hi Mohamad! On 31/12/2018 19:30, Mohamad Gebai wrote: > On 12/31/18 4:51 AM, Marcus Murwall wrote: >> What you say does make sense though as I also get the feeling that the >> osds are just waiting for something. Something that never happens and >> the request finally timeout... > > So the OSDs are just completely idle? If not, try using strace and/or > perf to get some insights into what they're doing. > > Maybe someone with better knowledge of EC internals will suggest > something. In the mean time, you might want to look at the client side. > Could the client be somehow saturated or blocked on something? (If the > clients aren't blocked you can use 'perf' or Mark's profiler [1] to > profile them). > > Try benchmarking with an iodepth of 1 and slowly increase it until you > run into the issue, all while monitoring your resources. You might find > something that causes the tipping point. Are you able to reproduce this > using fio? Maybe this is just a client issue.. > > Sorry for suggesting a bunch of things that are all over the place, I'm > just trying to understand the state of the cluster (and clients). Are > both the OSDs and the clients completely blocked and make no progress? > > Let us know what you find. Just to not leave this thread dangling: thanks very much for your help here — unfortunately we've decided to go with a replicated bucket data pool in this case, and we've been unable to dig into this any further. We've had no performance issues with the production pools since. As for the dedicated benchmark EC pool, we've been swamped with other issues but if either of us does find some bandwidth (the mental kind) to rerun the tests per your suggestions in the near future, we'll be sure to report back. Thanks again! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com