Re: isolate_freepages_block and excessive CPU usage by OSD process

Christian Marie <christian@xxxxxxxxx> · Thu, 20 Nov 2014 08:20:13 +1100

On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
> > currently working on adding support for that (the hardware supports it). Are
> > you also using ipoib or have something else doing high order allocations? It's
> > a bit concerning for me if you don't as it would suggest that cutting down on
> > those allocations won't help.
> 
> So do I. On a test environment with regular tengig cards I was unable to
> reproduce the issue. Honestly, I thought that almost every contemporary
> driver for high-speed cards is working with scatter-gather, so I had not mlx
> in mind as a potential cause of this problem from very beginning.

Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
as you go switch to CM they turn of hardware IP csums and SG support. The only
question I remain to answer before testing a patched driver is whether or not
the messages sent by Ceph are fragmented enough to save allocations. If not, we
could always patch Ceph as well but this is beginning to snowball.

Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
talking to the original author of a larger patch to review and split that and
get them both upstream.:

https://gist.github.com/christian-marie/e8048b9c118bd3925957

> There are a couple of reports in ceph lists, complaining for OSD
> flapping/unresponsiveness without clear reason on certain (not always clear
> though) conditions which may have same root cause.

Possibly, though ipoib and Ceph seem to be a relatively rare combination.
Someone will likely find this thread if it is the same root cause.

> Wonder if numad-like mechanism will help there, but its usage is generally an
> anti-performance pattern in my experience.

We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
tried is striping, which I don't want to do anyway.

If these large allocations are indeed a reasonable thing to ask of the
compaction/reclaim subsystem that seems like the best way forward. I have two
questions that follow from this conjecture:

Are compaction behaving badly or are we just asking for too many high order
allocations?

Is this fixed in a later kernel? I haven't tested yet.
Attachment:
pgpxrpfegc8VA.pgp

Description: PGP signature