Re: isolate_freepages_block and excessive CPU usage by OSD process

Andrey Korolyov <andrey@xxxxxxx> · Sat, 15 Nov 2014 22:52:05 +0400

On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka <vbabka@xxxxxxx> wrote:
> On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
>> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@xxxxxxx> wrote:
>>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>>> Hello,
>>>>
>>>> I had found recently that the OSD daemons under certain conditions
>>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>>> go into loop involving isolate_freepages and effectively hit Ceph
>>>> cluster performance. I found this thread
>>>
>>> Do you feel it is a regression, compared to some older kernel version or something?
>>
>> No, it`s just a rare but very concerning stuff. The higher pressure
>> is, the more chance to hit this particular issue, although absolute
>> numbers are still very large (e.g. room for cache memory). Some
>> googling also found simular question on sf:
>> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>> but there are no perf info unfortunately so I cannot say if the issue
>> is the same or not.
>
> Well it would be useful to find out what's doing the high-order allocations.
> With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
> allocation flags can be captured by enabling the page_alloc tracepoint.

Thanks, please give me some time to go through testing iterations, so
I`ll collect appropriate perf.data.
>
>>>
>>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>>> Although I have approximately a half of physical memory for cache-like
>>>> stuff, the problem with mm persists, so I would like to try
>>>> suggestions from the other people. In current testing iteration I had
>>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>>> background ratio to 15 and 10 correspondingly (because default values
>>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>>> 3.10.
>>>
>>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>>> (especially for transparent hugepages) since 3.10.
>>
>> Heh, I may say that I limited to pushing knobs in 3.10, because it has
>> a well-known set of problems and any major version switch will lead to
>> months-long QA procedures, but I may try that if none of mine knob
>> selection will help. I am not THP user, the problem is happening with
>> regular 4k pages and almost default VM settings. Also it worth to mean
>
> OK that's useful to know. So it might be some driver (do you also have
> mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.

Yes, I am using mellanox transport there and SLUB allocator, as SLAB
had some issues with allocations with uneven node fill-up on a
two-head system which I am primarily using.

>
>> that kernel messages are not complaining about allocation failures, as
>> in case in URL from above, compaction just tightens up to some limit
>
> Without the warnings, that's why we need tracing/profiling to find out what's
> causing it.
>
>> and (after it 'locked' system for a couple of minutes, reducing actual
>> I/O and derived amount of memory operations) it goes back to normal.
>> Cache flush fixing this just in a moment, so should large room for
>
> That could perhaps suggest a poor coordination between reclaim and compaction,
> made worse by the fact that there are more parallel ongoing attempts and the
> watermark checking doesn't take that into account.
>
>> min_free_kbytes. Over couple of days, depends on which nodes with
>> certain settings issue will reappear, I may judge if my ideas was
>> wrong.
>>
>>>
>>>> Non-default VM settings are:
>>>> vm.swappiness = 5
>>>> vm.dirty_ratio=10
>>>> vm.dirty_background_ratio=5
>>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>>> situation worsened, because unstable OSD host cause domino-like effect
>>>> on other hosts, which are starting to flap too and only cache flush
>>>> via drop_caches is helping.
>>>>
>>>> Unfortunately there are no slab info from "exhausted" state due to
>>>> sporadic nature of this bug, will try to catch next time.
>>>>
>>>> slabtop (normal state):
>>>>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>>>>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>>>>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>>>>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>>>>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>>
>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
>>>> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
>>>> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
>>>> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
>>>> 113022  80618  71%    0.19K   2691       42     21528K dentry
>>>> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>>>>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>>>>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>>>>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>>>>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>>>>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>>>>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>>>>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>>>>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>>>>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>>>>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>>>>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>>>>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>>>>  16475  14200  86%    0.16K    659       25      2636K sigqueue
>>>>
>>>> zoneinfo (normal state, attached)
>>>>
>>>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>