On Thu, Jul 13, 2023 at 07:41:53PM +0800, Qu Wenruo wrote: > On 2023/7/13 19:26, David Sterba wrote: > > On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote: > >> On 2023/7/13 00:41, Christoph Hellwig wrote: > >>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote: > >>>> One of the biggest problem for metadata folio conversion is, we still > >>>> need the current page based solution (or folios with order 0) as a > >>>> fallback solution when we can not get a high order folio. > >>> > >>> Do we? btrfs by default uses a 16k nodesize (order 2 on x86), with > >>> a maximum of 64k (order 4). IIRC we should be able to get them pretty > >>> reliably. > >> > >> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine > >> with that. > > > > I have mentioned my concerns about the allocation problems with higher > > order than 0 in the past. Allocator gives some guarantees about not > > failing for certain levels, now it's 1 (mm/fail_page_alloc.c > > fail_page_alloc.min_oder = 1). > > > > Per comment in page_alloc.c:rmqueue() > > > > 2814 /* > > 2815 * We most definitely don't want callers attempting to > > 2816 * allocate greater than order-1 page units with __GFP_NOFAIL. > > 2817 */ > > 2818 WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > > > For allocations with higher order, eg. 4 to match the default 16K nodes, > > this increases pressure and can trigger compaction, logic around > > PAGE_ALLOC_COSTLY_ORDER which is 3. > > > >>> If not the best thning is to just a virtually contigous allocation as > >>> fallback, i.e. use vm_map_ram. > > > > So we can allocate 0-order pages and then map them to virtual addresses, > > which needs manipulation of PTE (page table entries), and requires > > additional memory. This is what xfs does, > > fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory, > > so vm_unmap_aliases() is required and brings some overhead, and at the > > end vm_unmap_ram() needs to be called, another overhead but probably > > bearable. > > > > With all that in place there would be a contiguous memory range > > representing the metadata, so a simple memcpy() can be done. Sure, > > with higher overhead and decreased reliability due to potentially > > failing memory allocations - for metadata operations. > > > > Compare that to what we have: > > > > Pages are allocated as order 0, so there's much higher chance to get > > them under pressure and not increasing the pressure otherwise. We don't > > need any virtual mappings. The cost is that we have to iterate the pages > > and do the partial copying ourselves, but this is hidden in helpers. > > > > We have different usage pattern of the metadata buffers than xfs, so > > that it does something with vmapped contiguous buffers may not be easily > > transferable to btrfs and bring us new problems. > > > > The conversion to folios will happen eventually, though I don't want to > > sacrifice reliability just for API use convenience. First the conversion > > should be done 1:1 with pages and folios both order 0 before switching > > to some higher order allocations hidden behind API calls. > > In fact, I have another solution as a middle ground before adding folio > into the situation. > > Check if the pages are already physically continuous. > If so, everything can go without any cross-page handling. > > If not, we can either keep the current cross-page handling, or migrate > to the virtually continuous mapped pages. > > Currently we already have around 50~66% of eb pages are already > allocated physically continuous. Memory fragmentation becomes problem over time on systems running for weeks/months, then the contiguous ranges will became scarce. So if you measure that on a system with a lot of memory and for a short time then of course this will reach high rate of contiguous pages.