Re: Summary of LSF-MM Volatile Ranges Discussion

Minchan Kim <minchan@xxxxxxxxxx> · Wed, 24 Apr 2013 17:14:13 +0900

Hello Dmitry,

On Tue, Apr 23, 2013 at 10:51:10AM +0400, Dmitry Vyukov wrote:
> On Tue, Apr 23, 2013 at 7:11 AM, John Stultz <john.stultz@xxxxxxxxxx> wrote:
> > Just wanted to send out this quick summary of the Volatile Ranges discussion
> > at LSF-MM.
> >
> > Again, this is my recollection and perspective of the discussion, and while
> > I'm trying to also provide Minchan's perspective on some of the problems as
> > best I can, there likely may be details that were misunderstood, or
> > mis-remembered. So if I've gotten anything wrong, please step in and reply
> > to correct me. :)
> >
> >
> > Prior to the discussion, I sent out some background and discussion plans
> > which you can read here:
> > http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> >
> >
> > First of all, we quickly reviewed the generalized use cases and proposed
> > interfaces:
> >
> > 1) madvise style interface:
> >         mvrange(start_addr, length, mode, flags, &purged)
> >
> > 2) fadvise/fallocate style interface:
> >         fvrange(fd, start_off, length, mode, flags, &purged)
> >
> >
> > Also noting (per the background summary) the desired semantics for volatile
> > ranges on files is that the volatility is shared (just like the data is),
> > thus we need to store that volatility off of the address_space. Thus only
> > one process needs to mark the open file pages as volatile for them to be
> > purged.
> >
> > Where as with anonymous memory, we really want to store the volatility off
> > of the mm_struct (in some way), and only if all the processes that map a
> > page consider it volatile, do purging.
> >
> > I tried to quickly describe the issue that as performance is a concern, we
> > want the action of marking and umarking of volatile ranges to be as fast as
> > possible. This is of particular concern to Minchan and his ebizzy test case,
> > as taking the mmap_sem hurts performance too much.
> >
> > However, this strong performance concern causes some complexity in the
> > madvise style interface, as since a volatile range could cross both
> > anonymous and file pages.
> >
> > Particularly the question of "What happens if a user calls mvrange() over
> > MMAP_SHARED file pages?". I think we should push that volatility down into
> > the file volatility, but to do this we have to walk the vmas and take the
> > mmap_sem, which hurts Minchan's use case too drastically.
> >
> > Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE |
> > VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the
> > VOLATILE_ANON case, just adding the range to the process. Where as
> > VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.
> >
> > However, there is still the problem of the case where someone marks
> > VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report an
> > error, however, in order to detect the error case, we'd have to still
> > traverse the vmas (otherwise we can't know if the range covers files or
> > not), which again would be too costly. And to me, Minchan's suggestion of
> > not providing an error on this case, seemed a bit too unintuitive for a
> > public interface.
> >
> > The morning of the discussion, I realized we could instead of thinking of
> > volatility only on anonymous and file pages, we could instead think of
> > volatility as shared or private, much as file mappings are.
> >
> > This would allow for the same functional behavior of Minchan's VOLATILE_ANON
> > vs VOLATILE_FILE modes, but instead we'd have VOLATILE_PRIVATE and
> > VOLATILE_SHARED. And only in the VOLATILE_SHARED case would we need to
> > traverse the VMAs in order to make sure that any file backed pages had the
> > volatility added to their address_space. And private volatility on files
> > would then not be considered an error mode, so we could avoid having to do
> > the scan to validate the input.
> >
> > Minchan seemed to be in agreement with this concept. Though when I asked for
> > reactions from the folks in the room, it seemed to be mostly tepid agreement
> > mixed maybe with a bit of confusion.
> >
> > One issue raised was the concern that by keeping the private/anonymous
> > volatility state separately from the VMAs might cause cases where things got
> > "out-of-sync". For instance, if a range is marked volatile, then say some
> > pages are unmapped or a hole is punched in that range and other pages are
> > mapped in, what are the semantics of the resulting volatility? Is the
> > volatility inherited to future ranges? The example was given of mlock, where
> > a range can be locked, but should any new pages be mapped into that range,
> > the new pages are not locked. In other words, only the pages mapped at that
> > time are affected by the call to mlock.
> >
> > Stumped by this, I agreed that was a fair critique we hadn't considered, and
> > that the in current implementation any new mappings in an existing volatile
> > range would be considered volatile, and that is inconsistent with existing
> > precedent.
> >
> > It was pointed out that we could also make sure that on any unmapping or new
> > mapping that we clear the private/anonymous volatility, and that might keep
> > things in sync. and still allowing for the fast non-vma traversing calls to
> > mark and unmark voltile ranges. But we'll have to look into that.
> >
> > It was also noted that vmas are specifically designed to manage ranges of
> > memory, so it seemed maybe a bit duplicative to have a separate tree
> > tracking volatile ranges. And again we discussed the performance impact of
> > taking the mmap_sem and traversing the vmas, and how avoiding that is
> > particularly important to Minchan's use case.
> >
> > I also noted that one difficulty with the earlier approach that did use vmas
> > was that for volatile ranges on files (ie: shared volatile mappings), there
> > are no similar shared vma type structure for files. Thus its nice to be able
> > to use the same volatile root structure to store volatile ranges on both the
> > private per-process(well, per-mm_struct) and shared per-inode/address_space
> > basis. Otherwise the code paths for anonymous and file volatility have to be
> > significantly different, which would make it more complex to understand and
> > maintain.
> >
> > At this point, it was asked if the shared-volatility semantics on the shared
> > mapped file is actually desired. And if instead we could keep file
> > volatility in the vmas, only purging should every process that maps that
> > file agree that the page is volatile.
> >
> > The problem with this, as I see it is that it is inconsistent with the
> > semantics of shared mapped files. If a file is mapped by multiple processes,
> > and zeros are written to that file by one processes, all the processes will
> > see this change and they need to coordinate access if such a change would be
> > problematic. In the case of volatility, when we purge pages, the kernel is
> > in-effect doing this on-behalf of the process that marked the range
> > volatile. It just is a delayed action and can be canceled (by the process
> > that marks it volatile, or by any other process with that range mapped).  I
> > re-iterated the example of a large circular buffer in a shared file, which
> > is initialized as entirely volatile. Then a producer process would mark a
> > region after the head as non-volatile, then fill it with data. And a
> > consumer process, then consumes data from the tail, and mark those consumed
> > ranges as volatile.
> >
> > It was pointed out that the same could maybe be done by both processes
> > marking the entire range, except what is between the current head and tail
> > as volatile each iteration. So while pages wouldn't be truly volatile right
> > after they were consumed, eventually the producer would run (well,
> > hopefully) and update its view of volatility so that it agreed with the
> > consumer with respect to those pages.
> >
> > I noted that first of all, the shared volatility is needed to match the
> > Android ashmem semantics. So there's at least an existing user. And that
> > while this method pointed out could be used, I still felt it is fairly
> > awkward, and again inconsistent with how shared mapped files normally
> > behave. After all, applications could "share" file data by coordinating such
> > that they all writing the same data to their own private mapping, but that
> > loses much of the usefulness of shared mappings (to be fair, I didn't have
> > such a sharp example at the time of the discussion, but its the same point I
> > rambled around). Thus I feel having shared volatility for file pages is
> > similarly useful.
> >
> > It was also asked about the volatility semantics would be for non-mapped
> > files, given the fvrange() interface could be used there. In that case, I
> > don't have a strong opinion. If mvrange can create shared volatile ranges on
> > mmaped files, I'm fine leaving fvrange() out. There may be an in-kerenl
> > equivalent of fvrange() to make it easier to support Android's ashmem, but
> > volatility on non-mmapped files doesn't seem like it would be too useful to
> > me. But I'd probably want to go with what would be least surprising to
> > users.
> >
> > It was hard to gauge the overall reaction in the room at this point. There
> > was some assorted nodding by various folks who seemed to be following along
> > and positive of the basic approach. There were also some less positive
> > confused squinting that had me worried.
> >
> > With time running low, Minchan reminded me that the shrinker was on the
> > to-be-discussed list. Basically earlier versions of my patch used a shrinker
> > to trigger range purging, and this was critiqued because shrinkers were
> > numa-unaware, and might cause bad behavior where we might purge lots of
> > ranges on a node that isn't under any memory pressure if one node is under
> > pressure.  However, using normal LRU page eviction doesn't work for volatile
> > ranges, as with swapless systems, we don't LRU age/evict anonymous memory.
> >
> > Minchan's patch currently does two approaches, where it can use the normal
> > LRU eviction to trigger purging, but it also uses a shrinker to force
> > anonymous pages onto a page list which can then be evicted in vmscan. This
> > allows purging of anonymous pages when swapless, but also allows the normal
> > eviction process to work.
> >
> > This brought up lots of discussion around what the ideal method would be.
> > Since because the marking and unmarking of pages as volatile has to be done
> > quickly, so we cannot iterate over pages at mark/unmark time creating a new
> > list. Aging and evicting all anonymous memory on swapless systems also seems
> > wasteful.
> >
> > Ideally, I think we'd purge pages from volatile ranges in the global LRU
> > eviction order. This would hopefully avoid purging data when we see lots of
> > single-use streaming data.
> >
> > Minchan however seems to feel volatile data should be purged earlier then
> > other pages, since they're a source of easily free-able memory (I've also
> > argued for this in the past, but have since changed my mind). So he'd like a
> > way to pruge pages earlier, and unfortunately the shrinker runs later then
> > he'd like.
> >
> > It was noted that there are now patches to make the shrinkers numa aware, so
> > the older complains might be solvable. But still the issue of shrinkers
> > having their own eviction logic separate from the global LRU is less then
> > ideal to me.
> >
> > It was past time, and there didn't seem to be much consensus or resolution
> > on this issue, so we had to leave it there. That said, the volatile purging
> > logic is up to the kernel, and can be tweaked as needed in the future, where
> > as the basic interface semantics were more important to hash out, and I
> > think I got mostly nodding on the majority of the interface issues.
> >
> > Hopefully with the next patch iteration, we'll have things cleaned up a bit
> > more and better unified between Minchn's and my approaches so further
> > details can be concretely worked out on the list. It was also requested that
> > a manpage document be provided with the next patch set, which I'll make a
> > point to provide.
> >
> > Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, Michal,
> > Glauber, and everyone else for providing an active discussion and great
> > feedback despite my likely over-caffeinated verbal wanderings.
> 
> 
> Hi,
> 
> Just want to make sure our case does not fall out of the discussion:
> https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges
> 
> While reading your email, I remembered that we actually have some
> pages mapped from a file inside the range. So it's like 70TB of ANON
> mapping + few pages in the middle mapped from FILE. The file is mapped
> with MAP_PRIVATE + PROT_READ, it's read-only and not shared.
> But we want to mark the volatile range only once on startup, so
> performance is not a serious concern (while the function in executed
> in say no more than 10ms).
> If the mixed ANON+FILE ranges becomes a serious problem, we are ready
> to remove FILE mappings, because it's only an optimization. I.e. we
> can make it pure ANON mapping.

As I mentioned by private mail, there are no issue to support your requirement.
What we need is just voice of customer and you are giving the voice now. :)
So no problem, IMO.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>