On Fri, Aug 19, 2016 at 03:44:26PM +0900, Minchan Kim wrote: > On Thu, Aug 18, 2016 at 08:44:13PM -0700, Huang, Ying wrote: > > Minchan Kim <minchan@xxxxxxxxxx> writes: > > > > > Hi Huang, > > > > > > On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote: > > >> Minchan Kim <minchan@xxxxxxxxxx> writes: > > >> > > >> > Hi Tim, > > >> > > > >> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote: > > >> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote: > > >> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote: > > >> >> > > > > >> >> > > > > >> >> > > > > > >> >> > > > I think Tim and me discussed about that a few weeks ago. > > >> >> > > I work closely with Tim on swap optimization.?This patchset is the part > > >> >> > > of our swap optimization plan. > > >> >> > > > > >> >> > > > > > >> >> > > > Please search below topics. > > >> >> > > > > > >> >> > > > [1] mm: Batch page reclamation under shink_page_list > > >> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions > > >> >> > > > > > >> >> > > > It's different with yours which focused on THP swapping while the suggestion > > >> >> > > > would be more general if we can do so it's worth to try it, I think. > > >> >> > > I think the general optimization above will benefit both normal pages > > >> >> > > and THP at least for now.?And I think there are no hard conflict > > >> >> > > between those two patchsets. > > >> >> > If we could do general optimzation, I guess THP swap without splitting > > >> >> > would be more straight forward. > > >> >> > > > >> >> > If we can reclaim batch a certain of pages all at once, it helps we can > > >> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be > > >> >> > greater or less than 512 pages. With that, scan_swap_map effectively > > >> >> > search empty swap slots from scan_map or free cluser list. > > >> >> > Then, needed part from your patchset is to just delay splitting of THP. > > >> >> > > > >> >> > > > > >> >> > > > > >> >> > > The THP swap has more opportunity to be optimized, because we can batch > > >> >> > > 512 operations together more easily.?For full THP swap support, unmap a > > >> >> > > THP could be more efficient with only one swap count operation instead > > >> >> > > of 512, so do many other operations, such as add/remove from swap cache > > >> >> > > with multi-order radix tree etc.?And it will help memory fragmentation. > > >> >> > > THP can be kept after swapping out/in, need not to rebuild THP via > > >> >> > > khugepaged. > > >> >> > It seems you increased cluster size to 512 and search a empty cluster > > >> >> > for a THP swap. With that approach, I have a concern that once clusters > > >> >> > will be fragmented, THP swap support doesn't take benefit at all. > > >> >> > > > >> >> > Why do we need a empty cluster for swapping out 512 pages? > > >> >> > IOW, below case could work for the goal. > > >> >> > > > >> >> > A : Allocated slot > > >> >> > F : Free slot > > >> >> > > > >> >> > cluster A?cluster B > > >> >> > AAAAFFFF?-?FFFFAAAA > > >> >> > > > >> >> > That's one of the reason I suggested batch reclaim work first and > > >> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages > > >> >> > and selects right clusters. > > >> >> > > > >> >> > With the approach, justfication of THP swap support would be easier, too. > > >> >> > IOW, I'm not sure how only THP swap support is valuable in real workload. > > >> >> > > > >> >> > Anyways, that's just my two cents. > > >> >> > > >> >> Minchan, > > >> >> > > >> >> Scanning for contiguous slots that span clusters may take quite a > > >> >> long time under fragmentation, and may eventually fail. In that case the addition scan > > >> >> time overhead may go to waste and defeat the purpose of fast swapping of large page. > > >> >> > > >> >> The empty cluster lookup on the other hand is very fast. > > >> >> We treat the empty cluster available case as an opportunity for fast path > > >> >> swap out of large page. Otherwise, we'll revert to the current > > >> >> slow path behavior of breaking into normal pages so there's no > > >> >> regression, and we may get speed up. We can be considerably faster when a lot of large > > >> >> pages are used. > > >> > > > >> > I didn't mean we should search scan_swap_map firstly without peeking > > >> > free cluster but what I wanted was we might abstract it into > > >> > scan_swap_map. > > >> > > > >> > For example, if nr_pages is greather than the size of cluster, we can > > >> > get empty cluster first and nr_pages - sizeof(cluster) for other free > > >> > cluster or scanning of current CPU per-cpu cluster. If we cannot find > > >> > used slot during scanning, we can bail out simply. Then, although we > > >> > fail to get all* contiguous slots, we get a certain of contiguous slots > > >> > so it would be benefit for seq write and lock batching point of view > > >> > at the cost of a little scanning. And it's not specific to THP algorighm. > > >> > > >> Firstly, if my understanding were correct, to batch the normal pages > > >> swapping out, the swap slots need not to be continuous. But for the THP > > >> swap support, we need the continuous swap slots. So I think the > > >> requirements are quite different between them. > > > > > > Hmm, I don't understand. > > > > > > Let's think about swap slot management layer point of view. > > > It doesn't need to take care of that a amount of batch request is caused > > > by a thp page or multiple normal pages. > > > > > > A matter is just that VM now asks multiple swap slots for seveal LRU-order > > > pages so swap slot management tries to allocate several slots in a lock. > > > Sure, it would be great if slots are consecutive fully because it means > > > it's fast big sequential write as well as readahead together ideally. > > > However, it would be better even if we didn't get consecutive slots because > > > we get muliple slots all at once by batch. > > > > > > It's not a THP specific requirement, I think. > > > Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by > > > normal page batch but it could be changed later once we implement batching > > > logic nicely. > > > > Consecutive or not may influence the performance of the swap slots > > allocation function greatly. For example, there is some non-consecutive > > swap slots at the begin of the swap space, and some consecutive swap > > slots at the end of the swap space. If the consecutive swap slots are > > needed, the function may need to scan from the begin to the end. If > > non-consecutive swap slots are required, just return the swap slots at > > the begin of the swap space. > > Don't get me wrong. I never said consecutive swap slot allocation is > not important and should scan swap_map fully for searching consecutive > swap slot. > > Both multiple normal page swap and a THP swap, consecutive swap slot > allocation is important so that it's a same requirement so I want to > abstract it regardless of THP swap. > > > > > >> And with the current design of the swap space management, it is quite > > >> hard to implement allocating nr_pages continuous free swap slots. To > > >> reduce the contention of sis->lock, even to scan one free swap slot, the > > >> sis->lock is unlocked during scanning. When we scan nr_pages free swap > > >> slots, and there are no nr_pages continuous free swap slots, we need to > > >> scan from sis->lowest_bit to sis->highest_bit, and record the largest > > >> continuous free swap slots. But when we lock sis->lock again to check, > > >> some swap slot inside the largest continuous free swap slots we found > > >> may be allocated by other processes. So we may end up with a much > > >> smaller number of swap slots or we need to startover again. So I think > > >> the simpler solution is to > > >> > > >> - When a whole cluster is requested (for the THP), try to allocate a > > >> free cluster. Give up if there are no free clusters. > > > > > > One thing I'm afraid that it would consume free clusters very fast > > > if adjacent pages around a faulted one doesn't have same hottness/ > > > lifetime. Once it happens, we can't get benefit any more. > > > IMO, it's too conservative and might be worse for the fragment point > > > of view. > > > > It is possible. But I think we should start from the simple solution > > firstly. Instead of jumping to the perfect solution directly. > > Especially when the simple solution is a subset of the perfect solution. > > Do you agree? > > If simple solution works well and is hard to prove it's not bad than as-is, > I agree. But my concern is about that it would consume free clusters so fast > that it can affect badly for other workload. > > > > > There are some other difficulties not to use the swap cluster to hold > > the THP swapped out for the full THP swap support (without splitting). > > > > The THP could be mapped in both PMD and PTE. After the THP is swapped > > out. There may be swap entry in PMD and PTE too. If a non-head PTE is > > accessed, how do we know where is the first swap slot for the THP, so > > that we can swap in the whole THP? > > You mean you want to swapin 2M pages all at once? Hmm, I'm not sure > it's a good idea. We don't have any evidence 512 pages have same time > locality. They were just LRU in-order due to split implementation, > not time locality. A thing we can bet is any processes sharing the THP > doesn't touch a subpage in 512 pages so it's really *cold*. > For such cold 512 page swap-in, I am really not sure. > > > > > We can have a flag in cluster_info->flag to mark whether the swap > > cluster backing a THP. So swap in readahead can avoid to read ahead the > > THP, or it can read ahead the whole THP instead of just several > > sub-pages of the THP. > > > > And if we use one swap cluster for each THP, we can use cluster_info->data > > to hold compound map number. That is very convenient. > > Huang, > > If you think my points are enough valid, just continue your work If you don't think Oops, typo. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>