On Sat, May 20, 2023, at 04:53, Guo Ren wrote: > On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@xxxxxxxx> wrote: >> On Thu, May 18, 2023, at 15:09, guoren@xxxxxxxxxx wrote: >> >> I've tried to run the same numbers for the debate about running >> 32-bit vs 64-bit arm kernels in the past, but focused mostly on >> slightly larger systems, but I looked mainly at the 512MB case, >> as that is the most cost-efficient DDR3 memory configuration >> and fairly common. > 512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for > 480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K. >> 512MB chips is less than 5% of the total (I guess). Even in 512MB > chips, the additional memory is for the frame buffer, not the Linux > system. This depends a lot on the target application of course. For a phone or NAS box, 512MB is probably the lower limit. What I observe in arch/arm/ devicetree submissions, in board-db.org, and when looking at industrial Arm board vendor websites is that 512MB is the most common configuration, and I think 1GB is still more common than 256MB even for 32-bit machines. There is of course a difference between number of individual products, and number of machines shipped in a given configuration, and I guess you have a good point that the cheapest ones are also the ones that ship in the highest volume. >> What I'd like to understand better in your example is where >> the 14MB of memory went. I assume this is for 128MB of total >> RAM, so we know that 1MB went into additional 'struct page' >> objects (32 bytes * 32768 pages). It would be good to know >> where the dynamic allocations went and if they are reclaimable >> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128). >> >> For the vmlinux size, is this already a minimal config >> that one would run on a board with 128MB of RAM, or a >> defconfig that includes a lot of stuff that is only relevant >> for other platforms but also grows on 64-bit? > It's not minimal config, it's defconfig. So I say it's a roungh > measurement :) > > I admit I wanted a little bit to exaggerate it, but that's the > starting point for cutting down memory usage for most people, right? > During the past year, we have been convincing our customers to use the > s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional > cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could > run 32-bit Linux. I think it's too early to talk about throwing 32-bit > Linux into the garbage, not only for the reason of memory footprint > but also for the ingrained opinion of the people. Changing their mind > needs a long time. > >> >> What do you see in /proc/slabinfo, /proc/meminfo/, and >> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here? > Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and > the same opensbi binary. > All are opensbi(2MB) + Linux(126MB) memory layout. > > Here is the result: > > s64ilp32: > [ 0.000000] Virtual kernel memory layout: > [ 0.000000] fixmap : 0x9ce00000 - 0x9d000000 (2048 kB) > [ 0.000000] pci io : 0x9d000000 - 0x9e000000 ( 16 MB) > [ 0.000000] vmemmap : 0x9e000000 - 0xa0000000 ( 32 MB) > [ 0.000000] vmalloc : 0xa0000000 - 0xc0000000 ( 512 MB) > [ 0.000000] lowmem : 0xc0000000 - 0xc7e00000 ( 126 MB) > [ 0.000000] Memory: 97748K/129024K available (8699K kernel code, > 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K > cma-reserved) Ok, so it saves only a little bit on .text/.init/.bss/.rodata, but there is a 4MB difference in rwdata, and a total of 10.4MB difference in "reserved" size, which I think includes all of the above plus the mem_map[] array. 89380K/131072K available (8638K kernel code, 4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K cma-reserved) Oddly, I don't see anywhere close to 8KB in a riscv64 defconfig build (linux-next, gcc-13), so I don't know where that comes from: $ size -A build/tmp/vmlinux | sort -k2 -nr | head Total 13518684 .text 8896058 18446744071562076160 .rodata 2219008 18446744071576748032 .data 933760 18446744071583039488 .bss 476080 18446744071584092160 .init.text 264718 18446744071572553728 __ksymtab_strings 183986 18446744071579214312 __ksymtab_gpl 122928 18446744071579091384 __ksymtab 109080 18446744071578982304 __bug_table 98352 18446744071583973248 > KReclaimable: 644 kB > Slab: 4536 kB > SReclaimable: 644 kB > SUnreclaim: 3892 kB > KernelStack: 344 kB These look like the only notable differences in meminfo: KReclaimable: 1092 kB Slab: 6900 kB SReclaimable: 1092 kB SUnreclaim: 5808 kB KernelStack: 688 kB The largest chunk here is 2MB in non-reclaimable slab allocations, or a 50% growth of those. The kernel stacks are doubled as expected, but that's only 344KB, similarly for reclaimable slabs. > # cat /proc/slabinfo > > [68/1691] > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : > slabdata <active_slabs> <num_slabs> <sharedavail> > ext4_groupinfo_1k 28 28 144 28 1 : tunables 0 0 > 0 : slabdata 1 1 0 > p9_req_t 0 0 104 39 1 : tunables 0 0 Did you perhaps miss a few lines while pasting these? It seems odd that some caches only show up in the ilp32 case (proc_dir_entry, bd2_journa_handle, buffer_head, biovec_max, anon_vma_chain, ...) and some others are only in the lp64 case (UNIX, ext4_prealloc_space, files_cache, filp, ip_fib_alias, task_struct, uid_cache, ...). Looking at the ones that are in both and have the largest size increase, I see # lp64 1788 kernfs_node_cache 14304 128 590 shmem_inode_cache 646 936 272 inode_cache 360 776 153 ext4_inode_cache 105 1496 250 dentry 1188 216 192 names_cache 48 4096 199 radix_tree_node 350 584 307 kmalloc-64 4912 64 60 kmalloc-128 480 128 47 kmalloc-192 252 192 204 kmalloc-256 816 256 72 kmalloc-512 144 512 840 kmalloc-1k 840 1024 # ilp32 1197 kernfs_node_cache 13938 88 373 shmem_inode_cache 637 600 174 inode_cache 360 496 84 ext4_inode_cache 88 984 177 dentry 1196 152 32 names_cache 8 4096 100 radix_tree_node 338 304 331 kmalloc-64 5302 64 132 kmalloc-128 1056 128 23 kmalloc-192 126 192 16 kmalloc-256 64 256 428 kmalloc-512 856 512 88 kmalloc-1k 88 1024 So sysfs (kernfs_node_cache) has the largest chunk of the 2MB non-reclaimable slab, grown 50% from 1.2MB to 1.8MB. In some cases, this could be avoided entirely by turning off sysfs, but most users can't do that. shmem_inode_cache is probably mostly devtmpfs, the other inode caches ones are smaller and likely reclaimable. It's interesting how the largest slab cache ends up being the kmalloc-1k cache (840 1K objects) on lp64, but the kmalloc-512 cache (856 512B objects) on ilp32. My guess is that the majority of this is from a single callsite that has an allocation groing just beyond 512B. This alone seems significant enough to need further investigation, I would hope we can completely avoid these by adding a custom slab cache. I don't see this effect on an arm64 boot though, for me the 512B allocations are much higher the 1K ones. Maybe you can identify the culprit using the boot-time traces as listed in https://elinux.org/Kernel_dynamic_memory_analysis#Dynamic That might help everyone running a 64-bit kernel on low-memory configurations, though it would of course slightly weaken your argument for an ilp32 kernel ;-) Arnd