On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Chris Li <chrisl@xxxxxxxxxx> writes: > > > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Chris Li <chrisl@xxxxxxxxxx> writes: > >> > >> > Hi Ying, > >> > > >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> > >> >> Chris Li <chrisl@xxxxxxxxxx> writes: > >> >> > >> >> > I am spinning a new version for this series to address two issues > >> >> > found in this series: > >> >> > > >> >> > 1) Oppo discovered a bug in the following line: > >> >> > + ci = si->cluster_info + tmp; > >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> >> > That is a serious bug but trivial to fix. > >> >> > > >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> >> > the cluster->order. > >> >> > >> >> IIUC, now, we only scan swap_map[] only if > >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> >> That is, if you doesn't run low swap free space, you will not do that. > >> > > >> > You can still swap space in order 0 clusters while order 4 runs out of > >> > free_cluster > >> > or nonfull_clusters[order]. For Android that is a common case. > >> > >> When we fail to allocate order 4, we will fallback to order 0. Still > >> don't need to scan swap_map[]. But after looking at your below reply, I > >> realized that the swap space is almost full at most times in your cases. > >> Then, it's possible that we run into scanning swap_map[]. > >> list_empty(&si->free_clusters) && > >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too > >> many clusters in si->percpu_cluster. So, if we want to avoid to scan > >> swap_map[], we can stop add clusters in si->percpu_cluster when swap > >> space runs low. And maybe take clusters out of si->percpu_cluster > >> sometimes. > > > > One idea after reading your reply. If we run out of the > > nonfull_cluster[order], we should be able to use other cpu's > > si->percpu_cluster[] as well. That is a very small win for Android, > > This will be useful in general. The number CPU may be large, and > multiple orders may be used. > > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Isn't limiting order-0 allocation breaks the bottom line that order-0 allocation is the first class citizen, and should not fail if there is space? Just my two cents... I had a try locally based on Chris's work, allowing order 0 to use nonfull_clusters as Ying has suggested, and starting with low order and increase the order until nonfull_cluster[order] is not empty, that way higher order is just better protected, because unless we ran out of free_cluster and nonfull_cluster, direct scan won't happen. More concretely, I applied the following changes, which didn't change the code much: - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then free_clusters, then discard_cluster. - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster returns false. A quick test still using the memtier test, but decreased the swap device size from 10G to 8g for higher pressure. Before: hugepages-32kB/stats/swpout:34013 hugepages-32kB/stats/swpout_fallback:266 hugepages-512kB/stats/swpout:0 hugepages-512kB/stats/swpout_fallback:77 hugepages-2048kB/stats/swpout:0 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35088 hugepages-64kB/stats/swpout_fallback:66 hugepages-16kB/stats/swpout:31848 hugepages-16kB/stats/swpout_fallback:402 hugepages-256kB/stats/swpout:390 hugepages-256kB/stats/swpout_fallback:7244 hugepages-128kB/stats/swpout:28573 hugepages-128kB/stats/swpout_fallback:474 After: hugepages-32kB/stats/swpout:31448 hugepages-32kB/stats/swpout_fallback:3354 hugepages-512kB/stats/swpout:30 hugepages-512kB/stats/swpout_fallback:33 hugepages-2048kB/stats/swpout:2 hugepages-2048kB/stats/swpout_fallback:0 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:31255 hugepages-64kB/stats/swpout_fallback:3112 hugepages-16kB/stats/swpout:29931 hugepages-16kB/stats/swpout_fallback:3397 hugepages-256kB/stats/swpout:5223 hugepages-256kB/stats/swpout_fallback:2351 hugepages-128kB/stats/swpout:25600 hugepages-128kB/stats/swpout_fallback:2194 High order (256k) swapout rate are significantly higher, 512k is now possible, which indicate high orders are better protected, lower orders are sacrificed but seems worth it.