On Wed, Nov 19, 2014 at 4:21 AM, Christian Marie <christian@xxxxxxxxx> wrote: >> Hello, >> >> I had found recently that the OSD daemons under certain conditions >> (moderate vm pressure, moderate I/O, slightly altered vm settings) can >> go into loop involving isolate_freepages and effectively hit Ceph >> cluster performance. > > Hi! I'm the creator of the server fault issue you reference: > > http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph > > I'd like to get to the bottom of this very much, I'm seeing a very similar > pattern on 3.10.0-123.9.3.el7.x86_64, if this is fixed in later versions > perhaps we could backport something. > > Here is some perf output: > > http://ponies.io/raw/compaction.png > > Looks pretty similar. I also have hundreds of MB logs and traces should we need > some specific question answered. > > I've managed to reproduce many failed compactions with this: > > https://gist.github.com/christian-marie/cde7e80c5edb889da541 > > I took some compaction stress test code and bolted on a little loop to mmap a > large sparse file and read every PAGE_SIZEth byte. > > Run it once, compactions seem to do okay, run it again and they're really slow. > This seems to be because my little trick to fill up cache memory only seems to > work exactly half the time. Note that transhuge pages are only used to > introduce fragmentation/pressure here, turning transparent huge pages off > doesn't seem to make the slightest difference to the spinning-in-reclaim issue. > > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm > currently working on adding support for that (the hardware supports it). Are > you also using ipoib or have something else doing high order allocations? It's > a bit concerning for me if you don't as it would suggest that cutting down on > those allocations won't help. So do I. On a test environment with regular tengig cards I was unable to reproduce the issue. Honestly, I thought that almost every contemporary driver for high-speed cards is working with scatter-gather, so I had not mlx in mind as a potential cause of this problem from very beginning. There are a couple of reports in ceph lists, complaining for OSD flapping/unresponsiveness without clear reason on certain (not always clear though) conditions which may have same root cause. Wonder if numad-like mechanism will help there, but its usage is generally an anti-performance pattern in my experience. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>