I got a question about the behavior of linux which I do not understand currently. This is the situation: The server has 1T of memory, of which 700G of memory is allocated to hugepages (hp size 1G). This leaves 300G of memory in smallpages, for which I assume linux will apply it’s general memory behaviour. From the smallpages memory, I see > 250G being classified as file memory, and roughly only about 15G allocated to anon (anonymous memory). The load on the server is caused by a postgres database instance with on average 80 sessions active, of which a varying number is performing read and write IO. Postgres performs buffered reads and writes using the pread64() and prwrite64() calls, and always performs IO using an IO size of 8KB. However, it should be noted that postgres can also use posix_fadvise() to make the OS preread blocks using POSIX_FADV_WILLNEED. There might be independent asynchronous IO via direct path, but I have not been informed on how that exactly works. That IO might be on the postgres files the regular pread64 and pwrite64 are executing, but these calls are not part of open source postgres. The amount of IO that is taking place is also noteworthy: using the iotop utility I can both total and actual disk reads and writes going up to 3 GBPS for reads and up 500 MBPS for writes. The question I have is why linux chooses to swap, despite having lots of file memory, for which it reports (via MemAvailable) that it’s available. I need more tools on this machine, but I do not have the impression it’s extremely influencing sessions, although top (with the swap field added) shows that every postgres database process has swapped out memory. It also does not seem healthy to have ongoing swapping in and out continuously going on. Thank you. Frits Hoogland The filesystem is XFS, mount options noatime, inode64, nodiratime, nodev. Operating system: Red Hat Enterprise Linux 8.9 Kernel: 4.18.0-513.5.1.el8_9.x86_64 #1 SMP /proc/meminfo MemTotal: 1055737556 kB MemFree: 2213440 kB MemAvailable: 298459084 kB Buffers: 1340 kB Cached: 286127736 kB SwapCached: 837032 kB Active: 29872372 kB Inactive: 269357644 kB Active(anon): 3788932 kB Inactive(anon): 8167304 kB Active(file): 26083440 kB Inactive(file): 261190340 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 16777212 kB SwapFree: 8386212 kB Dirty: 5311136 kB Writeback: 0 kB AnonPages: 12670468 kB Mapped: 138672 kB Shmem: 64104 kB KReclaimable: 11589948 kB Slab: 13762220 kB SReclaimable: 11589948 kB SUnreclaim: 2172272 kB KernelStack: 38240 kB PageTables: 266476 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 324791060 kB Committed_AS: 37860836 kB VmallocTotal: 34359738367 kB VmallocUsed: 481952 kB VmallocChunk: 0 kB Percpu: 97792 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 704 HugePages_Free: 366 HugePages_Rsvd: 2 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 738197504 kB DirectMap4k: 17853212 kB DirectMap2M: 302462976 kB DirectMap1G: 752877568 kB vmstat 1 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 42 6 8424412 1743776 1340 299215616 8 11 6786 911 0 0 13 3 82 2 0 29 12 8428456 2678092 1340 298592000 1220 6092 2149152 249176 466196 366703 27 6 62 5 0 32 8 8427984 1701916 1340 299545600 1788 1832 2020852 316652 366820 309808 23 5 67 5 0 41 4 8430328 1702864 1340 299411136 960 4136 2228240 263160 433730 368056 24 6 64 6 0 49 4 8402344 1724772 1340 299495040 1392 6320 2435296 303464 463479 368963 23 7 64 6 0 33 5 8401056 1757348 1340 299520960 1788 4088 2107472 248576 395061 350817 23 9 63 5 0 30 11 8403788 1721484 1340 299539776 560 4012 2237708 229508 426055 384332 25 6 65 5 0 36 10 8409792 1904800 1340 299274848 364 6772 2192364 294444 428661 390878 25 6 64 6 0 33 8 8415368 1804560 1340 299320800 616 7112 2195100 272072 447890 398957 26 6 61 6 0 39 3 8386444 1885732 1340 299333088 2032 7172 2163180 266672 459675 419805 26 7 61 7 0 swapon -s Filename Type Size Used Priority /dev/dm-1 partition 16777212 8405472 -2 sysctl -a | grep ^vm vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.compact_unevictable_allowed = 1 vm.compaction_proactiveness = 0 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 20 vm.dirty_writeback_centisecs = 500 vm.dirtytime_expire_seconds = 43200 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.force_cgroup_v2_swappiness = 0 vm.hugetlb_shm_group = 32022 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 0 0 vm.max_map_count = 500000 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 71274 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.mmap_rnd_bits = 28 vm.mmap_rnd_compat_bits = 8 vm.nr_hugepages = 704 vm.nr_hugepages_mempolicy = 704 vm.nr_overcommit_hugepages = 0 vm.numa_stat = 1 vm.numa_zonelist_order = Node vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 2 vm.overcommit_ratio = 97 vm.page-cluster = 3 vm.page_lock_unfairness = 5 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 1 vm.user_reserve_kbytes = 131072 vm.vfs_cache_pressure = 100 vm.watermark_boost_factor = 15000 vm.watermark_scale_factor = 10 vm.zone_reclaim_mode = 0 |