On Wed, 21 Mar 2018, Matthew Wilcox wrote: > On Wed, Mar 21, 2018 at 12:39:33PM -0500, Christopher Lameter wrote: > > One other thought: If you want to improve the behavior for large scale > > objects allocated through kmalloc/kmemcache then we would certainly be > > glad to entertain those ideas. > > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > > allocate powers of two pages. It would be relatively easy to make > > kmalloc_large round the allocation to the next page size and then allocate > > N consecutive pages via alloc_pages_exact() and free the remainder unused > > pages or some such thing. alloc_pages_exact() has O(n*log n) complexity with respect to the number of requested pages. It would have to be reworked and optimized if it were to be used for the dm-bufio cache. (it could be optimized down to O(log n) if it didn't split the compound page to a lot of separate pages, but split it to a power-of-two clusters instead). > I don't know if that's a good idea. That will contribute to fragmentation > if the allocation is held onto for a short-to-medium length of time. > If the allocation is for a very long period of time then those pages > would have been unavailable anyway, but if the user of the tail pages > holds them beyond the lifetime of the large allocation, then this is > probably a bad tradeoff to make. The problem with alloc_pages_exact() is that it exhausts all the high-order pages and leaves many free low-order pages around. So you'll end up in a system with a lot of free memory, but with all high-order pages missing. As there would be a lot of free memory, the kswapd thread would not be woken up to free some high-order pages. I think that using slab with high order is better, because it at least doesn't leave many low-order pages behind. > I do see Mikulas' use case as interesting, I just don't know whether it's > worth changing slab/slub to support it. At first blush, other than the > sheer size of the allocations, it's a good fit. All I need is to increase the order of a specific slab cache - I think it's better to implement an interface that allows doing it than to duplicate the slab cache code. BTW. it could be possible to open the file "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write the requested value there, but it seems very dirty. It would be better to have a kernel interface for that. Mikulas