On Wed, Nov 08, 2023 at 03:56:00PM +0000, Per Förlin wrote: > Hi Linux XFS community, > > Please bare with me I'm new to XFS :) > > I'm comparing how EXT4 and XFS behaves on systems with a relative > small RAM vs storage ratio. The current focus is on FS repair memory consumption. > > I have been running some tests using the max_mem_specified option. > The "-m" (max_mem_specified) parameter does not guarantee success but it surely helps > to reduce the memory load, in comparison to EXT4 this is an improvement. > > My question concerns the relation between "-P" (disable prefetch) and "-m" (max_mem_specified). > > There is a difference in xfs_repair memory consumption between the following commands > 1. xfs_repair -P -m 500 > 2. xfs_repair -m 500 > > 1) Exceeds the max_mem_specified limit > 2) Stays below the max_mem_specified limit Purely co-incidental, IMO. As the man page says: -m maxmem Specifies the approximate maximum amount of memory, in megabytes, to use for xfs_repair. xfs_repair has its own internal block cache which will scale out up to the lesser of the process’s virtual address limit or about 75% of the system’s physical RAM. This option overrides these limits. NOTE: These memory limits are only approximate and may use more than the specified limit. IOWs, behaviour is expected - the max_mem figure is just a starting point guideline, and it only affects the size of the IO cache that repair holds. We still need lots of memory to index free space, used space, inodes, hold directory information, etc, so memory usage on any filesystem with enough metadata in it to fill the internal buffer cache will always go over this number.... > I expected disabled prefetch to reduce the memory load but instead the result is the opposite. > The two commands 1) and 2) are being executed in the same system. > My speculation: > Does the prefetch facilitate and improve the calculation of the memory > consumption and make it more accurate? No, prefetching changes the way processing of the metadata occurs. It also vastly changes the way IO is done and the buffer cache is populated. e.g. prefetching looks at metadata density and issues large IOs if the density is high enough and then chops them up into individual metadata buffers in memory at prefetch IO completion. This avoids doing lots of small IOs, greatly improving IO throughput and keeping the processing pipeline busy. This comes at the cost of increased CPU overhead and non-buffer cache memory footprint, but for slow IO devices this can improve IO throughput (and hence repair times) by a factor of up to 100x. Have a look at the difference in IO patterns when you enable/disable prefetching... When prefetching is turned off, the processing issues individual IO itself and doesn't do density-based scan optimisation. In some cases this is faster (e.g. high speed SSDs) because it is more CPU efficient, but it results in different IO patterns and buffer access patterns. The end result is that buffers have a very different life time when prefetching is turned on compared to when it is off, and so there's a very different buffer cache memory footprint between the two options. > Here follows output with -P and without -P from the same system. > I have extracted the part the actually differs. > The full logs are available the bottom of this email. > > # -P -m 500 # > Phase 3 - for each AG... > ... > Active entries = 12336 > Hash table size = 1549 > Hits = 1 > Misses = 224301 > Hit ratio = 0.00 > MRU 0 entries = 12335 ( 99%) > MRU 1 entries = 0 ( 0%) > MRU 2 entries = 0 ( 0%) > MRU 3 entries = 0 ( 0%) > MRU 4 entries = 0 ( 0%) > MRU 5 entries = 0 ( 0%) > MRU 6 entries = 0 ( 0%) > MRU 7 entries = 0 ( 0%) Without prefetching, we have a single use for all buffers and the metadata accessed is 20x larger than the size of the buffer cache (220k vs 12k for the cache size). This is just showing how the non-prefetch case is just streaming buffers through the cache in processing access order. i.e. The MRU list indicates that nothing is being kept for long periods or being accessed out of order as all buffers are on list 0 (most recently used). i.e. nothing is aging out and which means buffers are being used and reclaimed in the same order they are being instantiated. If anything was being accessed out of order, we would see buffers move down the aging lists.... > # -m 500 # > Phase 3 - for each AG... > ... > Active entries = 12388 > Hash table size = 1549 > Hits = 220459 > Misses = 235388 > Hit ratio = 48.36 And there's the difference - two accesses per buffer for the prefetch case. One for the IO dispatch to bring it into memory (the miss) and one for processing (the hit). > MRU 0 entries = 2 ( 0%) > MRU 1 entries = 0 ( 0%) > MRU 2 entries = 1362 ( 10%) > MRU 3 entries = 68 ( 0%) > MRU 4 entries = 10 ( 0%) > MRU 5 entries = 6097 ( 49%) > MRU 6 entries = 4752 ( 38%) > MRU 7 entries = 96 ( 0%) And the MRU list shows how the buffer access are not uniform - we are seeing buffers of all different ages in the cache. This shows that buffers are being aged 5-6 times before they are getting used, which means the cache size is almost too small for prefetch to work effectively.... Actually, the cache is too small - cache misses are significantly larger than cache hits, meaning some buffers are being fetched from disk twice because the prefetched buffers are aging out before the processing thread gets to them. Give xfs_repair ~5GB of RAM, and it should only need to do a single IO pass in phase 3 and then phase 4 and 6 will hit the buffers in the cache and hence not need to do any IO at all... So to me, this is prefetch working as it should - it's bringing buffers into cache in the optimal IO pattern rather than the application level access pattern. The difference in memory footprint compared to no prefetching is largely co-incidental and really not something we are concerned about in any way... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx