On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote: > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@xxxxxxxxxxx>님이 작성: > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > > > But, I still think that modified refault activation equation isn't > > > safe. The next > > > problem I found is related to the scan ratio limit patch ("limit the range of > > > LRU type balancing") on this series. See the below example. > > > > > > anon: Hot (X M) > > > file: Hot (200 M) / dummy (200 M) > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3) > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> > > > > > > Without this patch, A and F(H) are kept on the memory and look like > > > it's correct. > > > > > > With this patch and below fix, refault equation for Pn would be: > > > > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan > > > ratio (from anon non-resident) > > > anon + active file = X + 200 > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 > > > > That doesn't look quite right to me. The anon part of the refault > > distance is driven by X, so the left-hand of this formula contains X > > as well. > > > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000 > > As I said before, there is no X on left-hand of this formula. To > access all Pn and > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed. > With your patch "limit the range of LRU type balancing", scan ratio > between file/anon > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M * > 2.0, that is, > 2400 M and not bounded by X. That means that file list cannot be > stable with some X. Oh, no X on the left because you're talking about the number of pages scanned until the first refaults, which is fixed - so why are we still interpreting the refault distance against a variable anon size X? Well, that's misleading. We compare against anon because part of the cache is already encoded in the refault distance. What we're really checking is access distance against total amount of available RAM. Consider this. We want to activate pages where access_distance <= RAM and our measure of access distance is: access_distance = refault_distance + inactive_file So the comparison becomes: refault_distance + inactive_file < RAM which we simplify to: refault_distance < active_file + anon There is a certain threshold for X simply because there is a certain threshold for RAM beyond which we need to start activating. X cannot be arbitrary, it must be X + cache filling up memory - after all we have page reclaim evicting pages. Again, this isn't new. In the current code, we activate when: refault_distance < active_file which is access_distance <= RAM - anon You can see, whether things are stable or not always depends on the existing workingset size. It's just a proxy for how much total RAM we have potentially available to the refaulting page. > If my lastly found example is a correct example (your confirm is required), > it is also related to the correctness issue since cold pages causes > eviction of the hot pages repeatedly. I think your example is correct, but it benefits from the VM arbitrarily making an assumption that has a 50/50 shot of being true. You and I know which pages are hot and which are cold because you designed the example. All the VM sees is this: - We have an established workingset that has previously shown an access distance <= RAM and therefor was activated. - We now have another set that also appears to have an access distance <= RAM. The only way to know for sure, however, is sample the established workingset and compare the relative access frequencies. Currently, we just assume the incoming pages are colder. Clearly that's beneficial when it's true. Clearly that can be totally wrong. We must allow a fair comparison between these two sets. For cache, that's already the case - that's why I brought up the cache-only example: if refault distances are 50M and you have 60M of active cache, we activate all refaults and force an even competition between the established workingset and the new pages. Whether we can protect active file when anon needs to shrink first and can't (the activate/iocost split) that's a different question. But I'm no longer so sure after looking into it further. First, we would need two different refault distances: either we consider anon age and need to compare to active_file + anon, or we don't and compare to active_file only. We cannot mix willy nilly, because the metrics wouldn't be comparable. We don't have the space to store two different eviction timestamps, nor could we afford to cut the precision in half. Second, the additional page flag required to implement it. Third, it's somewhat moot because we still have the same behavior when active_file would need to shrink and can't. There can't be a stable state as long as refault distances <= active_file. > In this case, they (without patch, with patch) all have some correctness > issue so we need to judge which one is better in terms of overall impact. > I don't have strong opinion about it so it's up to you to decide the way to go. If my patch was simply changing the default assumption on which pages are hot and which are cold, I would agree with you - the pros would be equal to the cons, one way wouldn't be more correct than the other. But that isn't what my patch is doing. What it does is get rid of the assumption, to actually sample and compare the access frequencies when there isn't enough data to make an informed decision. That's a net improvement.