Blocked requests/ops?

xserrano+ceph@xxxxxxxxxx (Xavier Serrano) · Wed, 27 May 2015 15:38:26 +0200

Hello,

On Wed May 27 21:20:49 2015, Christian Balzer wrote:

> 
> Hello,
> 
> On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
> 
> > Hello,
> > 
> > Slow requests, blocked requests and blocked ops occur quite often
> > in our cluster; too often, I'd say: several times during one day.
> > I must say we are running some tests, but we are far from pushing
> > the cluster to the limit (or at least, that's what I believe).
> > 
> > Every time a blocked request/operation happened, restarting the
> > affected OSD solved the problem.
> > 
> You should open a bug with that description and a way to reproduce things,
> even if only sometimes. 
> Having slow disks instead of an overloaded network causing permanently
> blocked requests definitely shouldn't happen.
> 
I totally agree. I'll try to reproduce and definitely open a bug.
I'll let you know.

> > Yesterday, we wanted to see if it was possible to minimize the impact
> > that backfills and recovery have over normal cluster performace.
> > In our case, performance dropped from 1000 cluster IOPS (approx)
> > to 10 IOPS (approx) when doing some kind of recovery.
> > 
> > Thus, we reduced the parameters "osd max backfills" and "osd recovery
> > max active" to 1 (defaults are 10 and 15, respectively). Cluster
> > performance during recovery improved to 500-600 IOPS (approx),
> > and overall recovery time stayed approximately the same (surprisingly).
> > 
> There are some "sleep" values for recovery and scrub as well, these help a
> LOT with loaded clusters, too.
> 
> > Since then, we have had no more slow/blocked requests/ops
> > (and our tests are still running). It is soon to say this, but
> > my guess is that osds/disks in our cluster cannot cope with
> > all I/O: network bandwidth is not an issue (10 GbE interconnection,
> > graphs show network usage is under control all the time), but
> > spindles are not high-performance (WD Green). Eventually, this might
> > lead to slow/blocked requests/ops (which shouldn't occur that often).
> >
> Ah yes, I was going to comment on your HDDs earlier.
> As Dan van der Ster at CERN will happily admit, using green, slow HDDs
> with Ceph (and no SSD journals) is a bad idea.
> 
> You're likely to see a VAST improvement with even just 1 journal SSD (of
> suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of
> course be better.

We do have SSDs, but we are not using them right now.
We have 4 SSD per osd host (24 SSD at the moment).
SSD model is Intel DC S3700 (400 GB).

We are testing different scenarios before making our final decision
(cache-tiering, journaling, separate pool,...).

> However with 20 OSDs per node, you're likely to go from a being
> bottlenecked by your HDDs to being CPU limited (when dealing with lots of
> small IOPS at least).
> Still, better than now for sure.
> 
This is very interesting, thanks for pointing it out!
What would you suggest to use in order to identify the actual
bottleneck? (disk, CPU, RAM, etc.). Tools like munin?

In addition, there are some kernel tunables that may be helpful
to improve overall performance. Maybe we are filling some kernel
internals and that limits our results (for instance, we had to increase
fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per
host). Which tunables should we observe?

Thank you very much again for your time.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de C?lcul
- Departament d'Arquitectura de Computadors, UPC

> BTW, if your monitors are just used for that function, 128GB is total and
> utter overkill. 
> They will be fine with 16-32GB, your storage nodes will be much better
> served (pagecache for hot read objects) with more RAM.
> And with 20 OSDs per node 32GB is pretty close to the minimum I'd
> recommend anyway.
> 
>  
> > Reducing I/O pressure caused by recovery and backfill undoubtedly
> > helped on improving cluster performance during recovery, that was
> > expected. But we did not expect that recovery time stayed the same...
> > The only explanation for this is that, during recovery, there are
> > lots of operations that fail due a timeout, are retried several
> > times, etc.
> > 
> > So if disks are the bottleneck, reducing such values may help as
> > well in normal cluster operation (when propagating the replicas,
> > for instance). And slow/blocked requests/ops do not occur (or at
> > least, occur less frequently).
> > 
> > Does this make sense to you? Any other thoughts?
> > 
> Very much so, see above for more thoughts.
> 
> Christian
>