On Wed, 21 Oct 2015, Vlastimil Babka wrote: > On 10/05/2015 05:01 AM, Hugh Dickins wrote: > > On Fri, 2 Oct 2015, Vlastimil Babka wrote: > > > > > Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed > > > mappings, even if the mapped portion does contain pages that were swapped > > > out. > > > This is because unlike private anonymous mappings, shmem does not change > > > pte > > > to swap entry, but pte_none when swapping the page out. In the smaps page > > > walk, such page thus looks like it was never faulted in. > > > > > > This patch changes smaps_pte_entry() to determine the swap status for > > > such > > > pte_none entries for shmem mappings, similarly to how mincore_page() does > > > it. > > > Swapped out pages are thus accounted for. > > > > > > The accounting is arguably still not as precise as for private anonymous > > > mappings, since now we will count also pages that the process in question > > > never > > > accessed, but only another process populated them and then let them > > > become > > > swapped out. I believe it is still less confusing and subtle than not > > > showing > > > any swap usage by shmem mappings at all. Also, swapped out pages only > > > becomee a > > > performance issue for future accesses, and we cannot predict those for > > > neither > > > kind of mapping. > > > > > > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > > > Acked-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > > > > Neither Ack nor Nack from me. > > > > I don't want to stand in the way of this patch, if you and others > > believe that it will help to diagnose problems in the field better > > than what's shown at present; but to me it looks dangerously like > > replacing no information by wrong information. > > > > As you acknowledge in the commit message, if a file of 100 pages > > were copied to tmpfs, and 100 tasks map its full extent, but they > > all mess around with the first 50 pages and take no interest in > > the last 50, then it's quite likely that that last 50 will get > > swapped out; then with your patch, 100 tasks are each shown as > > using 50 pages of swap, when none of them are actually using any. > > Yeah, but isn't it the same with private memory which was swapped out at some > point and we don't know if it will be touched or not? The > difference is in private case we know the process touched it at least > once, but that can also mean nothing for the future (or maybe it just > mmapped with MAP_POPULATE and didn't care about half of it). I see that as quite different myself; but agree that neither way predicts the future. Now, if you can make a patch to predict the future... FWIW, I do seem to be looking at it more from a point of view of how much swap the process is using, whereas you're looking at it more from a point of view of what delays would be incurred in accessing. > > That's basically what I was trying to say in the changelog. I interpret > the Swap: value as the amount of swap-in potential, if the process was > going to access it, which is what the particular customer also expects (see > below). In that case showing zero is IMHO wrong and inconsistent with the > anonymous private mappings. Yes, your changelog is honest about the difference, I don't dispute that. As I said, neither Ack nor Nack from me: I just don't feel in a position to judge whether changing the output of smaps to please this customer is likely to displease another customer or not. > > > It is rather as if we didn't bother to record Rss, and just put > > Size in there instead: you are (for understandable reasons) treating > > the virtual address space as if every page of it had been touched. > > > > But I accept that there may well be a class of processes and problems > > which would be better served by this fiction than the present: I expect > > you have much more experience of helping out in such situations than I. > > Well, the customers driving this change would in the best case want to > see the shmem swap accounted continuously and e.g. see it immediately in the > top output. Fixing (IMHO) the smaps output is the next best thing. The use > case here is a application that really doesn't like page faults, and has > background thread that checks and prefaults such areas when they are expected > to be used soon. So they would like to identify these areas. And I guess I won't be able to sell mlock(2) to you :) Still neither Ack nor Nack from me: while your number is more information (or misinformation) than always 0, it's still not clear to me that it will give them what they need. ... > > > > And for private mappings of tmpfs files? I expected it to show an > > inderminate mixture of the two, but it looks like you treat the private > > mapping just like a shared one, and take no notice of the COWed pages > > out on swap which would have been reported before. Oh, no, I think > > I misread, and you add the two together? I agree that's the easiest > > thing to do, and therefore perhaps the best; but it doesn't fill me > > with conviction that it's the right thing to do. > > Thanks for pointing this out, I totally missed this possibility! Well > the current patch is certainly not the right thing to do, as it can > over-account. The most correct solution would have to be implemented into the > page walk and only check into shmem radix tree for individual pages that were > not COWed. Michal Hocko suggested I try that, and although it does add some > overhead (the complexity is n*log(n) AFAICT), it's not that bad from > preliminary checks. Another advantage is that no new shmem code is needed, as > we can use the generic find_get_entry(). Unless we want to really limit the > extra complexity only to the special private mapping case with non-zero swap > usage of the shmem object etc... I'll repost the series with that approach. > > Other non-perfect solutions that come to mind: > > 1) For private mappings, count only the swapents. "Swap:" is no longer > showing full swap-in potential though. > 2) For private mappings, do not count swapents. Ditto. > 3) Provide two separate counters. The user won't know how much they > overlap, though. > > From these I would be inclined towards 3) as being more universal, although > then it's no longer a simple "we're fixing a Swap: 0 value which is wrong", > but closer to original Jerome's versions, which IIRC introduced several > shmem-specific counters. > > Well at least now I do understand why you don't particularly like this > approach... Have you considered extending mincore(2) for them? It was always intended that more info could be added into its byte array later - the man page I'm looking at says "The settings of the other bits [than the least significant] in each byte are undefined; these bits are reserved for possible later use." That way your customers could get a precise picture of the status of each page: without ambiguity as to whether it's anon, shmem, file, anon swap, shmem swap, whatever; without ambiguity as to where 40kB of 80kB lies in the region, the unused half or the vital half etc. Or forget passing back the info: just offer an madvise(,, MADV_POPULATE)? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html