Hello Dmitry, On Tue, Apr 23, 2013 at 10:51:10AM +0400, Dmitry Vyukov wrote: > On Tue, Apr 23, 2013 at 7:11 AM, John Stultz <john.stultz@xxxxxxxxxx> wrote: > > Just wanted to send out this quick summary of the Volatile Ranges discussion > > at LSF-MM. > > > > Again, this is my recollection and perspective of the discussion, and while > > I'm trying to also provide Minchan's perspective on some of the problems as > > best I can, there likely may be details that were misunderstood, or > > mis-remembered. So if I've gotten anything wrong, please step in and reply > > to correct me. :) > > > > > > Prior to the discussion, I sent out some background and discussion plans > > which you can read here: > > http://permalink.gmane.org/gmane.linux.kernel.mm/98676 > > > > > > First of all, we quickly reviewed the generalized use cases and proposed > > interfaces: > > > > 1) madvise style interface: > > mvrange(start_addr, length, mode, flags, &purged) > > > > 2) fadvise/fallocate style interface: > > fvrange(fd, start_off, length, mode, flags, &purged) > > > > > > Also noting (per the background summary) the desired semantics for volatile > > ranges on files is that the volatility is shared (just like the data is), > > thus we need to store that volatility off of the address_space. Thus only > > one process needs to mark the open file pages as volatile for them to be > > purged. > > > > Where as with anonymous memory, we really want to store the volatility off > > of the mm_struct (in some way), and only if all the processes that map a > > page consider it volatile, do purging. > > > > I tried to quickly describe the issue that as performance is a concern, we > > want the action of marking and umarking of volatile ranges to be as fast as > > possible. This is of particular concern to Minchan and his ebizzy test case, > > as taking the mmap_sem hurts performance too much. > > > > However, this strong performance concern causes some complexity in the > > madvise style interface, as since a volatile range could cross both > > anonymous and file pages. > > > > Particularly the question of "What happens if a user calls mvrange() over > > MMAP_SHARED file pages?". I think we should push that volatility down into > > the file volatility, but to do this we have to walk the vmas and take the > > mmap_sem, which hurts Minchan's use case too drastically. > > > > Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE | > > VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the > > VOLATILE_ANON case, just adding the range to the process. Where as > > VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing. > > > > However, there is still the problem of the case where someone marks > > VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report an > > error, however, in order to detect the error case, we'd have to still > > traverse the vmas (otherwise we can't know if the range covers files or > > not), which again would be too costly. And to me, Minchan's suggestion of > > not providing an error on this case, seemed a bit too unintuitive for a > > public interface. > > > > The morning of the discussion, I realized we could instead of thinking of > > volatility only on anonymous and file pages, we could instead think of > > volatility as shared or private, much as file mappings are. > > > > This would allow for the same functional behavior of Minchan's VOLATILE_ANON > > vs VOLATILE_FILE modes, but instead we'd have VOLATILE_PRIVATE and > > VOLATILE_SHARED. And only in the VOLATILE_SHARED case would we need to > > traverse the VMAs in order to make sure that any file backed pages had the > > volatility added to their address_space. And private volatility on files > > would then not be considered an error mode, so we could avoid having to do > > the scan to validate the input. > > > > Minchan seemed to be in agreement with this concept. Though when I asked for > > reactions from the folks in the room, it seemed to be mostly tepid agreement > > mixed maybe with a bit of confusion. > > > > One issue raised was the concern that by keeping the private/anonymous > > volatility state separately from the VMAs might cause cases where things got > > "out-of-sync". For instance, if a range is marked volatile, then say some > > pages are unmapped or a hole is punched in that range and other pages are > > mapped in, what are the semantics of the resulting volatility? Is the > > volatility inherited to future ranges? The example was given of mlock, where > > a range can be locked, but should any new pages be mapped into that range, > > the new pages are not locked. In other words, only the pages mapped at that > > time are affected by the call to mlock. > > > > Stumped by this, I agreed that was a fair critique we hadn't considered, and > > that the in current implementation any new mappings in an existing volatile > > range would be considered volatile, and that is inconsistent with existing > > precedent. > > > > It was pointed out that we could also make sure that on any unmapping or new > > mapping that we clear the private/anonymous volatility, and that might keep > > things in sync. and still allowing for the fast non-vma traversing calls to > > mark and unmark voltile ranges. But we'll have to look into that. > > > > It was also noted that vmas are specifically designed to manage ranges of > > memory, so it seemed maybe a bit duplicative to have a separate tree > > tracking volatile ranges. And again we discussed the performance impact of > > taking the mmap_sem and traversing the vmas, and how avoiding that is > > particularly important to Minchan's use case. > > > > I also noted that one difficulty with the earlier approach that did use vmas > > was that for volatile ranges on files (ie: shared volatile mappings), there > > are no similar shared vma type structure for files. Thus its nice to be able > > to use the same volatile root structure to store volatile ranges on both the > > private per-process(well, per-mm_struct) and shared per-inode/address_space > > basis. Otherwise the code paths for anonymous and file volatility have to be > > significantly different, which would make it more complex to understand and > > maintain. > > > > At this point, it was asked if the shared-volatility semantics on the shared > > mapped file is actually desired. And if instead we could keep file > > volatility in the vmas, only purging should every process that maps that > > file agree that the page is volatile. > > > > The problem with this, as I see it is that it is inconsistent with the > > semantics of shared mapped files. If a file is mapped by multiple processes, > > and zeros are written to that file by one processes, all the processes will > > see this change and they need to coordinate access if such a change would be > > problematic. In the case of volatility, when we purge pages, the kernel is > > in-effect doing this on-behalf of the process that marked the range > > volatile. It just is a delayed action and can be canceled (by the process > > that marks it volatile, or by any other process with that range mapped). I > > re-iterated the example of a large circular buffer in a shared file, which > > is initialized as entirely volatile. Then a producer process would mark a > > region after the head as non-volatile, then fill it with data. And a > > consumer process, then consumes data from the tail, and mark those consumed > > ranges as volatile. > > > > It was pointed out that the same could maybe be done by both processes > > marking the entire range, except what is between the current head and tail > > as volatile each iteration. So while pages wouldn't be truly volatile right > > after they were consumed, eventually the producer would run (well, > > hopefully) and update its view of volatility so that it agreed with the > > consumer with respect to those pages. > > > > I noted that first of all, the shared volatility is needed to match the > > Android ashmem semantics. So there's at least an existing user. And that > > while this method pointed out could be used, I still felt it is fairly > > awkward, and again inconsistent with how shared mapped files normally > > behave. After all, applications could "share" file data by coordinating such > > that they all writing the same data to their own private mapping, but that > > loses much of the usefulness of shared mappings (to be fair, I didn't have > > such a sharp example at the time of the discussion, but its the same point I > > rambled around). Thus I feel having shared volatility for file pages is > > similarly useful. > > > > It was also asked about the volatility semantics would be for non-mapped > > files, given the fvrange() interface could be used there. In that case, I > > don't have a strong opinion. If mvrange can create shared volatile ranges on > > mmaped files, I'm fine leaving fvrange() out. There may be an in-kerenl > > equivalent of fvrange() to make it easier to support Android's ashmem, but > > volatility on non-mmapped files doesn't seem like it would be too useful to > > me. But I'd probably want to go with what would be least surprising to > > users. > > > > It was hard to gauge the overall reaction in the room at this point. There > > was some assorted nodding by various folks who seemed to be following along > > and positive of the basic approach. There were also some less positive > > confused squinting that had me worried. > > > > With time running low, Minchan reminded me that the shrinker was on the > > to-be-discussed list. Basically earlier versions of my patch used a shrinker > > to trigger range purging, and this was critiqued because shrinkers were > > numa-unaware, and might cause bad behavior where we might purge lots of > > ranges on a node that isn't under any memory pressure if one node is under > > pressure. However, using normal LRU page eviction doesn't work for volatile > > ranges, as with swapless systems, we don't LRU age/evict anonymous memory. > > > > Minchan's patch currently does two approaches, where it can use the normal > > LRU eviction to trigger purging, but it also uses a shrinker to force > > anonymous pages onto a page list which can then be evicted in vmscan. This > > allows purging of anonymous pages when swapless, but also allows the normal > > eviction process to work. > > > > This brought up lots of discussion around what the ideal method would be. > > Since because the marking and unmarking of pages as volatile has to be done > > quickly, so we cannot iterate over pages at mark/unmark time creating a new > > list. Aging and evicting all anonymous memory on swapless systems also seems > > wasteful. > > > > Ideally, I think we'd purge pages from volatile ranges in the global LRU > > eviction order. This would hopefully avoid purging data when we see lots of > > single-use streaming data. > > > > Minchan however seems to feel volatile data should be purged earlier then > > other pages, since they're a source of easily free-able memory (I've also > > argued for this in the past, but have since changed my mind). So he'd like a > > way to pruge pages earlier, and unfortunately the shrinker runs later then > > he'd like. > > > > It was noted that there are now patches to make the shrinkers numa aware, so > > the older complains might be solvable. But still the issue of shrinkers > > having their own eviction logic separate from the global LRU is less then > > ideal to me. > > > > It was past time, and there didn't seem to be much consensus or resolution > > on this issue, so we had to leave it there. That said, the volatile purging > > logic is up to the kernel, and can be tweaked as needed in the future, where > > as the basic interface semantics were more important to hash out, and I > > think I got mostly nodding on the majority of the interface issues. > > > > Hopefully with the next patch iteration, we'll have things cleaned up a bit > > more and better unified between Minchn's and my approaches so further > > details can be concretely worked out on the list. It was also requested that > > a manpage document be provided with the next patch set, which I'll make a > > point to provide. > > > > Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, Michal, > > Glauber, and everyone else for providing an active discussion and great > > feedback despite my likely over-caffeinated verbal wanderings. > > > Hi, > > Just want to make sure our case does not fall out of the discussion: > https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges > > While reading your email, I remembered that we actually have some > pages mapped from a file inside the range. So it's like 70TB of ANON > mapping + few pages in the middle mapped from FILE. The file is mapped > with MAP_PRIVATE + PROT_READ, it's read-only and not shared. > But we want to mark the volatile range only once on startup, so > performance is not a serious concern (while the function in executed > in say no more than 10ms). > If the mixed ANON+FILE ranges becomes a serious problem, we are ready > to remove FILE mappings, because it's only an optimization. I.e. we > can make it pure ANON mapping. As I mentioned by private mail, there are no issue to support your requirement. What we need is just voice of customer and you are giving the voice now. :) So no problem, IMO. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>