Summary of LSF-MM Volatile Ranges Discussion

John Stultz <john.stultz@xxxxxxxxxx> · Mon, 22 Apr 2013 20:11:39 -0700

Just wanted to send out this quick summary of the Volatile Ranges 
discussion at LSF-MM.

Again, this is my recollection and perspective of the discussion, and 
while I'm trying to also provide Minchan's perspective on some of the 
problems as best I can, there likely may be details that were 
misunderstood, or mis-remembered. So if I've gotten anything wrong, 
please step in and reply to correct me. :)

Prior to the discussion, I sent out some background and discussion plans 
which you can read here:
http://permalink.gmane.org/gmane.linux.kernel.mm/98676

First of all, we quickly reviewed the generalized use cases and proposed 
interfaces:

1) madvise style interface:
	mvrange(start_addr, length, mode, flags, &purged)

2) fadvise/fallocate style interface:
	fvrange(fd, start_off, length, mode, flags, &purged)

Also noting (per the background summary) the desired semantics for 
volatile ranges on files is that the volatility is shared (just like the 
data is), thus we need to store that volatility off of the 
address_space. Thus only one process needs to mark the open file pages 
as volatile for them to be purged.

Where as with anonymous memory, we really want to store the volatility 
off of the mm_struct (in some way), and only if all the processes that 
map a page consider it volatile, do purging.

I tried to quickly describe the issue that as performance is a concern, 
we want the action of marking and umarking of volatile ranges to be as 
fast as possible. This is of particular concern to Minchan and his 
ebizzy test case, as taking the mmap_sem hurts performance too much.

However, this strong performance concern causes some complexity in the 
madvise style interface, as since a volatile range could cross both 
anonymous and file pages.

Particularly the question of "What happens if a user calls mvrange() 
over MMAP_SHARED file pages?". I think we should push that volatility 
down into the file volatility, but to do this we have to walk the vmas 
and take the mmap_sem, which hurts Minchan's use case too drastically.

Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE | 
VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the 
VOLATILE_ANON case, just adding the range to the process. Where as 
VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.

However, there is still the problem of the case where someone marks 
VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report 
an error, however, in order to detect the error case, we'd have to still 
traverse the vmas (otherwise we can't know if the range covers files or 
not), which again would be too costly. And to me, Minchan's suggestion 
of not providing an error on this case, seemed a bit too unintuitive for 
a public interface.

The morning of the discussion, I realized we could instead of thinking 
of volatility only on anonymous and file pages, we could instead think 
of volatility as shared or private, much as file mappings are.

This would allow for the same functional behavior of Minchan's 
VOLATILE_ANON vs VOLATILE_FILE modes, but instead we'd have 
VOLATILE_PRIVATE and VOLATILE_SHARED. And only in the VOLATILE_SHARED 
case would we need to traverse the VMAs in order to make sure that any 
file backed pages had the volatility added to their address_space. And 
private volatility on files would then not be considered an error mode, 
so we could avoid having to do the scan to validate the input.

Minchan seemed to be in agreement with this concept. Though when I asked 
for reactions from the folks in the room, it seemed to be mostly tepid 
agreement mixed maybe with a bit of confusion.

One issue raised was the concern that by keeping the private/anonymous 
volatility state separately from the VMAs might cause cases where things 
got "out-of-sync". For instance, if a range is marked volatile, then say 
some pages are unmapped or a hole is punched in that range and other 
pages are mapped in, what are the semantics of the resulting volatility? 
Is the volatility inherited to future ranges? The example was given of 
mlock, where a range can be locked, but should any new pages be mapped 
into that range, the new pages are not locked. In other words, only the 
pages mapped at that time are affected by the call to mlock.

Stumped by this, I agreed that was a fair critique we hadn't considered, 
and that the in current implementation any new mappings in an existing 
volatile range would be considered volatile, and that is inconsistent 
with existing precedent.

It was pointed out that we could also make sure that on any unmapping or 
new mapping that we clear the private/anonymous volatility, and that 
might keep things in sync. and still allowing for the fast non-vma 
traversing calls to mark and unmark voltile ranges. But we'll have to 
look into that.

It was also noted that vmas are specifically designed to manage ranges 
of memory, so it seemed maybe a bit duplicative to have a separate tree 
tracking volatile ranges. And again we discussed the performance impact 
of taking the mmap_sem and traversing the vmas, and how avoiding that is 
particularly important to Minchan's use case.

I also noted that one difficulty with the earlier approach that did use 
vmas was that for volatile ranges on files (ie: shared volatile 
mappings), there are no similar shared vma type structure for files. 
Thus its nice to be able to use the same volatile root structure to 
store volatile ranges on both the private per-process(well, 
per-mm_struct) and shared per-inode/address_space basis. Otherwise the 
code paths for anonymous and file volatility have to be significantly 
different, which would make it more complex to understand and maintain.

At this point, it was asked if the shared-volatility semantics on the 
shared mapped file is actually desired. And if instead we could keep 
file volatility in the vmas, only purging should every process that maps 
that file agree that the page is volatile.

The problem with this, as I see it is that it is inconsistent with the 
semantics of shared mapped files. If a file is mapped by multiple 
processes, and zeros are written to that file by one processes, all the 
processes will see this change and they need to coordinate access if 
such a change would be problematic. In the case of volatility, when we 
purge pages, the kernel is in-effect doing this on-behalf of the process 
that marked the range volatile. It just is a delayed action and can be 
canceled (by the process that marks it volatile, or by any other process 
with that range mapped).  I re-iterated the example of a large circular 
buffer in a shared file, which is initialized as entirely volatile. Then 
a producer process would mark a region after the head as non-volatile, 
then fill it with data. And a consumer process, then consumes data from 
the tail, and mark those consumed ranges as volatile.

It was pointed out that the same could maybe be done by both processes 
marking the entire range, except what is between the current head and 
tail as volatile each iteration. So while pages wouldn't be truly 
volatile right after they were consumed, eventually the producer would 
run (well, hopefully) and update its view of volatility so that it 
agreed with the consumer with respect to those pages.

I noted that first of all, the shared volatility is needed to match the 
Android ashmem semantics. So there's at least an existing user. And that 
while this method pointed out could be used, I still felt it is fairly 
awkward, and again inconsistent with how shared mapped files normally 
behave. After all, applications could "share" file data by coordinating 
such that they all writing the same data to their own private mapping, 
but that loses much of the usefulness of shared mappings (to be fair, I 
didn't have such a sharp example at the time of the discussion, but its 
the same point I rambled around). Thus I feel having shared volatility 
for file pages is similarly useful.

It was also asked about the volatility semantics would be for non-mapped 
files, given the fvrange() interface could be used there. In that case, 
I don't have a strong opinion. If mvrange can create shared volatile 
ranges on mmaped files, I'm fine leaving fvrange() out. There may be an 
in-kerenl equivalent of fvrange() to make it easier to support Android's 
ashmem, but volatility on non-mmapped files doesn't seem like it would 
be too useful to me. But I'd probably want to go with what would be 
least surprising to users.

It was hard to gauge the overall reaction in the room at this point. 
There was some assorted nodding by various folks who seemed to be 
following along and positive of the basic approach. There were also some 
less positive confused squinting that had me worried.

With time running low, Minchan reminded me that the shrinker was on the 
to-be-discussed list. Basically earlier versions of my patch used a 
shrinker to trigger range purging, and this was critiqued because 
shrinkers were numa-unaware, and might cause bad behavior where we might 
purge lots of ranges on a node that isn't under any memory pressure if 
one node is under pressure.  However, using normal LRU page eviction 
doesn't work for volatile ranges, as with swapless systems, we don't LRU 
age/evict anonymous memory.

Minchan's patch currently does two approaches, where it can use the 
normal LRU eviction to trigger purging, but it also uses a shrinker to 
force anonymous pages onto a page list which can then be evicted in 
vmscan. This allows purging of anonymous pages when swapless, but also 
allows the normal eviction process to work.

This brought up lots of discussion around what the ideal method would 
be. Since because the marking and unmarking of pages as volatile has to 
be done quickly, so we cannot iterate over pages at mark/unmark time 
creating a new list. Aging and evicting all anonymous memory on swapless 
systems also seems wasteful.

Ideally, I think we'd purge pages from volatile ranges in the global LRU 
eviction order. This would hopefully avoid purging data when we see lots 
of single-use streaming data.

Minchan however seems to feel volatile data should be purged earlier 
then other pages, since they're a source of easily free-able memory 
(I've also argued for this in the past, but have since changed my mind). 
So he'd like a way to pruge pages earlier, and unfortunately the 
shrinker runs later then he'd like.

It was noted that there are now patches to make the shrinkers numa 
aware, so the older complains might be solvable. But still the issue of 
shrinkers having their own eviction logic separate from the global LRU 
is less then ideal to me.

It was past time, and there didn't seem to be much consensus or 
resolution on this issue, so we had to leave it there. That said, the 
volatile purging logic is up to the kernel, and can be tweaked as needed 
in the future, where as the basic interface semantics were more 
important to hash out, and I think I got mostly nodding on the majority 
of the interface issues.

Hopefully with the next patch iteration, we'll have things cleaned up a 
bit more and better unified between Minchn's and my approaches so 
further details can be concretely worked out on the list. It was also 
requested that a manpage document be provided with the next patch set, 
which I'll make a point to provide.

Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, 
Michal, Glauber, and everyone else for providing an active discussion 
and great feedback despite my likely over-caffeinated verbal wanderings.

Thanks again,
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>