On Sun, Nov 17, 2024 at 8:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Mon, Nov 18, 2024 at 05:14:14PM +1300, Barry Song wrote: > > On Mon, Nov 18, 2024 at 5:03 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote: > > > > 2. In shrink_page_list function, if folioN is THP(2M), it may be splited > > > > and added to swap cache folio by folio. After adding to swap cache, > > > > it will submit io to writeback folio to swap, which is asynchronous. > > > > When shrink_page_list is finished, the isolated folios list will be > > > > moved back to the head of inactive lru. The inactive lru may just look > > > > like this, with 512 filioes have been move to the head of inactive lru. > > > > > > I was hoping that we'd be able to stop splitting the folio when adding > > > to the swap cache. Ideally. we'd add the whole 2MB and write it back > > > as a single unit. > > > > This is already the case: adding to the swapcache doesn’t require splitting > > THPs, but failing to allocate 2MB of contiguous swap slots will. > > Agreed we need to understand why this is happening. As I've said a few > times now, we need to stop requiring contiguity. Real filesystems don't > need the contiguity (they become less efficient, but they can scatter a > single 2MB folio to multiple places). > > Maybe Chris has a solution to this in the works? Hi Matthew and Chenridong, Sorry for the late reply. I don't have a working solution yet. I just have some ideas. One of the big challenges is what to do with swap cache. Currently when a folio was added to the swap cache, it assumed continued swap entry. There will be a lot of complexity to break that assumption. To make things worse, the discontiguous swap entry might belong to a different xarray due to the 64M swap address sharding. One idea is that we can have a special kind of swap device to do swap entry redirecting. For the swap out path, Let's say the real swapfile A is almost full. We want to allocate an order of 4 swap entries to folio F. If there are contiguous swap entries in A, the swap allocator just returns entry [A9 ..A12], with A9 as the head swap entry. That is the same as the normal path we have now. On the other hand, if there is no contiguous swap entry in A. Only non-contiguous swap entry A1, A3, A5, A7. Instead, we allocate from a special redirecting swap device R as R1, R2, R3, R4 with an IO redirecting array as [R1, A1, A3, A5, A7]. Swap device R is virtual, there is no real file backing on it, so the swap file size on R can grow or shrink as needed. In add_to_swap_cache(), we set folio F->swap = R1. Add F into swap cache S with entry [R1..R4] pointing to folio F. In other words, S[R1..R4] = F. Add additional lookup xarray L[R1..R4] = [R1, A1, A3, A5, A7]. For the rest of the code, we pass the R1 as the continuous swap entry to folio F. The swap_writepage_bdev_async() will recognize R as a special device. It will do the lookup xarray L[R1] to get the [R1, A1, A3, A5, A7], use that entry list to build the bio with 4 iovec instead of 1. Fill up the [A1,A3,A5,A7] into the bio vec. That is the swap write path. For the swap in, the page fault handler gets a fault at address X and looks up the pte containing swap entry R3. Look up the swap cache of S[R3] and get nothing, folio F is not in the swap cache. Recognize the R is a remapping device. The swap core will lookup L[R3] = [R1, A1,A3,A5,A7]. If we want to swap in order 2 folio. Then construct swap_read_folio_bdev_async() with iovec [A1, A3, A5, A7]. If we just want to swap in a 4k page. We can construct iovec as [A3] alone, given the swap entry starts from R1. That is the read path. For the simplicity, there is a lot of detail omitted in the description. Also on the implementation side, a lot of optimizations we might be able to do, e.g. using pointer lookup of R1 instead of xarray, we can use struct to hold R1 and [A1, A3, A5, A7] etc. This approach avoids a lot of complexity in breaking the continuity assumption of swap cache entries, at the cost of additional swap cache address space R. The lookup mapping L[R1..R4] = [R1, A1, A3, A5, A7] are minimally necessary data structures to track the IO remapping. I think that is unavoidable. Please let me know if you see any problem with the above approach. As always, feedback is welcome as well. Thanks Chris