On Thu, Feb 1, 2024 at 4:23 PM Robin Murphy <robin.murphy@xxxxxxx> wrote: > > On 2024-02-01 9:06 pm, Pasha Tatashin wrote: > > On Thu, Feb 1, 2024 at 3:56 PM Robin Murphy <robin.murphy@xxxxxxx> wrote: > >> > >> On 2024-02-01 7:30 pm, Pasha Tatashin wrote: > >>> From: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> > >>> > >>> The magazine buffers can take gigabytes of kmem memory, dominating all > >>> other allocations. For observability prurpose create named slab cache so > >>> the iova magazine memory overhead can be clearly observed. > >>> > >>> With this change: > >>> > >>>> slabtop -o | head > >>> Active / Total Objects (% used) : 869731 / 952904 (91.3%) > >>> Active / Total Slabs (% used) : 103411 / 103974 (99.5%) > >>> Active / Total Caches (% used) : 135 / 211 (64.0%) > >>> Active / Total Size (% used) : 395389.68K / 411430.20K (96.1%) > >>> Minimum / Average / Maximum Object : 0.02K / 0.43K / 8.00K > >>> > >>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > >>> 244412 244239 99% 1.00K 61103 4 244412K iommu_iova_magazine > >>> 91636 88343 96% 0.03K 739 124 2956K kmalloc-32 > >>> 75744 74844 98% 0.12K 2367 32 9468K kernfs_node_cache > >>> > >>> On this machine it is now clear that magazine use 242M of kmem memory. > >> > >> Hmm, something smells there... > >> > >> In the "worst" case there should be a maximum of 6 * 2 * > >> num_online_cpus() empty magazines in the iova_cpu_rcache structures, > >> i.e., 12KB per CPU. Under normal use those will contain at least some > >> PFNs, but mainly every additional magazine stored in a depot is full > >> with 127 PFNs, and each one of those PFNs is backed by a 40-byte struct > >> iova, i.e. ~5KB per 1KB magazine. Unless that machine has many thousands > >> of CPUs, if iova_magazine allocations are the top consumer of memory > >> then something's gone wrong. > > > > This is an upstream kernel + few drivers that is booted on AMD EPYC, > > with 128 CPUs. > > > > It has allocations stacks like these: > > init_iova_domain+0x1ed/0x230 iommu_setup_dma_ops+0xf8/0x4b0 > > amd_iommu_probe_finalize. > > And also init_iova_domain() calls for Google's TPU drivers 242M is > > actually not that much, compared to the size of the system. > > Hmm, I did misspeak slightly (it's late and I really should have left > this for tomorrow...) - that's 12KB per CPU *per domain*, but still that > would seem to imply well over 100 domains if you have 242MB of magazine > allocations while the iommu_iova cache isn't even on the charts... what > the heck is that driver doing? I am not sure what the driver is doing. However, I can check the actual allocation sizes for each init_iova_domain() and report on that later. > > (I don't necessarily disagree with the spirit of the patch BTW, I just > really want to understand the situation that prompted it, and make sure > we don't actually have a subtle leak somewhere.) Yes, the observability is needed here, because there were several optimizations that reduced the size of these magazines, and they still can be large. For example, for a while we had 1032-bytes per-magazine instead of 1024, this caused wasting almost half of magazine memroy with 2K slabs. This was fixed with: b4c9bf178ace iommu/iova: change IOVA_MAG_SIZE to 127 to save memory Also, earlier there was another optimization "32e92d9f6f87 iommu/iova: Separate out rcache init" that reduced cases when magazines need to be allocated. That also reduced overhead on our systems by a factor of 10. Yet, the magazines are still large, and I think it is time to improve observability for the future optimizations, and avoiding future regressions. Pasha