On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka <vbabka@xxxxxxx> wrote: > On 11/15/2014 06:10 PM, Andrey Korolyov wrote: >> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@xxxxxxx> wrote: >>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote: >>>> Hello, >>>> >>>> I had found recently that the OSD daemons under certain conditions >>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can >>>> go into loop involving isolate_freepages and effectively hit Ceph >>>> cluster performance. I found this thread >>> >>> Do you feel it is a regression, compared to some older kernel version or something? >> >> No, it`s just a rare but very concerning stuff. The higher pressure >> is, the more chance to hit this particular issue, although absolute >> numbers are still very large (e.g. room for cache memory). Some >> googling also found simular question on sf: >> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph >> but there are no perf info unfortunately so I cannot say if the issue >> is the same or not. > > Well it would be useful to find out what's doing the high-order allocations. > With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and > allocation flags can be captured by enabling the page_alloc tracepoint. Thanks, please give me some time to go through testing iterations, so I`ll collect appropriate perf.data. > >>> >>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the >>>> significant decrease of bdi max_ratio did not helped even for a bit. >>>> Although I have approximately a half of physical memory for cache-like >>>> stuff, the problem with mm persists, so I would like to try >>>> suggestions from the other people. In current testing iteration I had >>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and >>>> background ratio to 15 and 10 correspondingly (because default values >>>> are too spiky for mine workloads). The host kernel is a linux-stable >>>> 3.10. >>> >>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying >>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for >>> (especially for transparent hugepages) since 3.10. >> >> Heh, I may say that I limited to pushing knobs in 3.10, because it has >> a well-known set of problems and any major version switch will lead to >> months-long QA procedures, but I may try that if none of mine knob >> selection will help. I am not THP user, the problem is happening with >> regular 4k pages and almost default VM settings. Also it worth to mean > > OK that's useful to know. So it might be some driver (do you also have > mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations. Yes, I am using mellanox transport there and SLUB allocator, as SLAB had some issues with allocations with uneven node fill-up on a two-head system which I am primarily using. > >> that kernel messages are not complaining about allocation failures, as >> in case in URL from above, compaction just tightens up to some limit > > Without the warnings, that's why we need tracing/profiling to find out what's > causing it. > >> and (after it 'locked' system for a couple of minutes, reducing actual >> I/O and derived amount of memory operations) it goes back to normal. >> Cache flush fixing this just in a moment, so should large room for > > That could perhaps suggest a poor coordination between reclaim and compaction, > made worse by the fact that there are more parallel ongoing attempts and the > watermark checking doesn't take that into account. > >> min_free_kbytes. Over couple of days, depends on which nodes with >> certain settings issue will reappear, I may judge if my ideas was >> wrong. >> >>> >>>> Non-default VM settings are: >>>> vm.swappiness = 5 >>>> vm.dirty_ratio=10 >>>> vm.dirty_background_ratio=5 >>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the >>>> situation worsened, because unstable OSD host cause domino-like effect >>>> on other hosts, which are starting to flap too and only cache flush >>>> via drop_caches is helping. >>>> >>>> Unfortunately there are no slab info from "exhausted" state due to >>>> sporadic nature of this bug, will try to catch next time. >>>> >>>> slabtop (normal state): >>>> Active / Total Objects (% used) : 8675843 / 8965833 (96.8%) >>>> Active / Total Slabs (% used) : 224858 / 224858 (100.0%) >>>> Active / Total Caches (% used) : 86 / 132 (65.2%) >>>> Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%) >>>> Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K >>>> >>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >>>> 6890130 6889185 99% 0.10K 176670 39 706680K buffer_head >>>> 751232 721707 96% 0.06K 11738 64 46952K kmalloc-64 >>>> 251636 226228 89% 0.55K 8987 28 143792K radix_tree_node >>>> 121696 45710 37% 0.25K 3803 32 30424K kmalloc-256 >>>> 113022 80618 71% 0.19K 2691 42 21528K dentry >>>> 112672 35160 31% 0.50K 3521 32 56336K kmalloc-512 >>>> 73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt >>>> 61696 58644 95% 0.02K 241 256 964K kmalloc-16 >>>> 54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache >>>> 53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache >>>> 51200 50724 99% 0.03K 400 128 1600K kmalloc-32 >>>> 49120 46105 93% 1.00K 1535 32 49120K xfs_inode >>>> 30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace >>>> 28224 25742 91% 0.12K 882 32 3528K kmalloc-128 >>>> 28028 22691 80% 0.18K 637 44 5096K vm_area_struct >>>> 28008 28008 100% 0.22K 778 36 6224K xfs_ili >>>> 18944 18944 100% 0.01K 37 512 148K kmalloc-8 >>>> 16576 15154 91% 0.06K 259 64 1036K anon_vma >>>> 16475 14200 86% 0.16K 659 25 2636K sigqueue >>>> >>>> zoneinfo (normal state, attached) >>>> >>> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com