On Wed, Jun 26, 2024 at 12:49:09PM +0800, alexjlzheng@xxxxxxxxx wrote: > From: Jinliang Zheng <alexjlzheng@xxxxxxxxxxx> > > When the contents of the xfs_log_vec/xfs_log_iovec combination are > written to iclog, xfs_log_iovec loses its meaning in continuing to exist > in memory, because iclog already has a copy of its contents. We only > need to keep xfs_log_vec that takes up very little memory to find the > xfs_log_item that needs to be added to AIL after we flush the iclog into > the disk log space. > > Because xfs_log_iovec dominates most of the memory in the > xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until > iclog is flushed into the disk log space and releasing together with > xfs_log_vec is a significant waste of memory. Have you measured this? Please provide numbers and the workload that generates them, because when I did this combined structure the numbers and performance measured came out decisively on the side of "almost no difference in memory usage, major performance cost to doing a second allocation"... Here's the logic - the iovec array is largely "free" with the larger data allocation. ------ Look at how the heap is structured - it is in power of 2 slab sizes: $ grep kmalloc /proc/slabinfo |tail -13 kmalloc-8k 949 976 8192 4 8 : tunables 0 0 0 : slabdata 244 244 0 kmalloc-4k 1706 1768 4096 8 8 : tunables 0 0 0 : slabdata 221 221 0 kmalloc-2k 3252 3312 2048 16 8 : tunables 0 0 0 : slabdata 207 207 0 kmalloc-1k 76110 96192 1024 32 8 : tunables 0 0 0 : slabdata 3006 3006 0 kmalloc-512 71753 98656 512 32 4 : tunables 0 0 0 : slabdata 3083 3083 0 kmalloc-256 71006 71520 256 32 2 : tunables 0 0 0 : slabdata 2235 2235 0 kmalloc-192 10304 10458 192 42 2 : tunables 0 0 0 : slabdata 249 249 0 kmalloc-128 8889 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0 kmalloc-96 13583 13902 96 42 1 : tunables 0 0 0 : slabdata 331 331 0 kmalloc-64 63116 64640 64 64 1 : tunables 0 0 0 : slabdata 1010 1010 0 kmalloc-32 552726 582272 32 128 1 : tunables 0 0 0 : slabdata 4549 4549 0 kmalloc-16 444768 445440 16 256 1 : tunables 0 0 0 : slabdata 1740 1740 0 kmalloc-8 18178 18432 8 512 1 : tunables 0 0 0 : slabdata 36 36 0 IOws, if we do a 260 byte allocation, we get the same sized memory chunk as a 512 byte allocation as they come from the same slab cache. If we now look at structure sizes - the common ones are buffers and inodes so we'll look at then. For an inode, we typically log something like this for an extent allocation (or free) on mostly contiguous inode (say less than 10 extents) vec 1: inode log format item vec 2: inode core vec 3: inode data fork Each of these vectors has a 12 byte log op header built into them, and some padding to round them out to 8 byte alignment. vec 1: inode log format item: 12 + 56 + 4 (pad) vec 2: inode core: 12 + 176 + 4 (pad) vec 3: inode data fork: 12 + 16 (minimum) + 4 (pad) 12 + 336 (maximum for 512 byte inode) If we are just logging the inode core, we are allocating 12 + 56 + 4 + 12 + 176 + 4 = 264 bytes. It should be obvious now that this must be allocated from the 512 byte slab, and that means we have another 248 bytes of unused space in that allocated region we can actually use -for free-. IOWs, the fact that we add 32 bytes for the 2 iovecs for to index this inode log item doesn't matter at all - it's free space on the heap. Indeed, it's not until the inode data fork gets to a couple of hundred bytes in length that we overflow the 512 byte slab and have to use the 1kB slab. Again, we get the iovec array space for free. If we are logging the entire inode with the data fork, then the size of the data being logged is 264 + 12 + 336 + 4 = 616 bytes. This is well over the 512 byte slab, so we are always going to be allocating from the 1kB slab. We get the iovec array for free the moment we go over the 512 byte threshold again. IOWs, all the separation of the iovec array does is slightly change the data/attr fork size thresholds where we go from using the 512 byte slab to the 1kB slab. A similar pattern holds out for the buffer log items. The minimum it will be is: vec 1: buf log format item vec 2: single 128 byte chunk This requires 12 + 40B + 4 + 12 + 128B + 4 = 200 bytes. For two vectors, we need 32 bytes for the iovec array, so a total of 232 bytes is needed, and this will fit in a 256 byte slab with or without the iovec array attached. The same situation occurs are we increase the number of logged regions or the size of the logged regions - in almost all cases we get the iovec array for free because we log 128 byte regions out of buffers and they will put us into the next largest size slab regardless of the memory used by the iovec array. Hence we should almost always get the space for the iovec array for free from the slab allocator, and separating it out doesn't actually reduce slab cache memory usage. If anything, it increases it, because now we are allocating the iovec array out of small slabs and so instead of it being "free" the memory usage is now accounted to smaller slabs... ----- Hence before we go any further with this patch set, I'd like to see see numbers that quantify how much extra memory the embedded iovec array is actually costing us. And from that, an explanation of why the above "iovec array space should be cost-free" logic isn't working as intended.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx