[PATCH 0/3] Volatile Ranges (v7) & Lots of words

John Stultz <john.stultz@xxxxxxxxxx> · Fri, 28 Sep 2012 23:16:30 -0400

After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)

General Interface semantics:
----------------------------------------------

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.

Specific Interface semantics:
---------------------------------------------

Here are some of the open question about how the user interface
should look:

fadvise vs fallocate:

	So while originally I used fadvise, currently my
	implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
	start, len) to mark a range as volatile and fallocate(fd,
	FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.

	During kernel summit, the question was brought up if fallocate
	was really the right interface to be using, and if fadvise
	would be better. To me fadvise makes a little more sense,
	but earlier it was pointed out that marking data ranges as
	volatile could also be seen as a type of cancellable and lazy
	hole-punching, so from that perspective fallocate might make
	more sense.  This is still an open question and I'd appreciate
	further input here.

tmpfs vs non-shmem filesystems:
	Android's ashmem primarily provides a way to get unlinked
	tmpfs fds that can be shared between applications. Its
	just an additional feature that those pages can "unpinned"
	or marked volatile in my terminology. Thus in implementing
	volatile ranges, I've focused on getting it to work on tmpfs
	file descriptors.  However, there has been some interest in
	using volatile ranges with more traditional filesystems. The
	semantics for how volatile range purging would work on a
	real filesystem are not well established, and I can't say I
	understand the utility quite yet, but there may be a case for
	having data that you know won't be committed to disk until it
	is marked as non-volatile.  However, returning an EINVAL on
	non-tmpfs filesystems until such a use is established should
	be fine.

fd based interfaces vs madvise:
	In talking with Taras Glek, he pointed out that for his
	needs, the fd based interface is a little annoying, as it
	requires having to get access to tmpfs file and mmap it in,
	then instead of just referencing a pointer to the data he
	wants to mark volatile, he has to calculate the offset from
	start of the mmap and pass those file offsets to the interface.
	Instead he mentioned that using something like madvise would be
	much nicer, since they could just pass a pointer to the object
	in memory they want to make volatile and avoid the extra work.

	I'm not opposed to adding an madvise interface for this as
	well, but since we have a existing use case with Android's
	ashmem, I want to make sure we support this existing behavior.
	Specifically as with ashmem  applications can be sharing
	these tmpfs fds, and so file-relative volatile ranges make
	more sense if you need to coordinate what data is volatile
	between two applications.

	Also, while I agree that having an madvise interface for
	volatile ranges would be nice, it does open up some more
	complex implementation issues, since with files, there is a
	fixed relationship between pages and the files' address_space
	mapping, where you can't have pages shared between different
	mappings. This makes it easy to hang the volatile-range tree
	off of the mapping (well, indirectly via a hash table). With
	general anonymous memory, pages can be shared between multiple
	processes, and as far as I understand, don't have any grouping
	structure we could use to determine if the page is in a
	volatile range or not. We would also need to determine more
	complex questions like: What are the semantics of volatility
	with copy-on-write pages?  I'm hoping to investigate this
	idea more deeply soon so I can be sure whatever is pushed has
	a clear plan of how to address this idea. Further thoughts
	here would be appreciated.

It would really be great to get any thoughts on these issues, as they
are higher-priority to me then diving into the details of how we
implement this internally, which can shift over time.

Implementation Considerations:
---------------------------------------------

How best to manage volatile ranges internally in the kernel is still
an open question.

With this patch set, I'm really wanting to provide a proof of concept
of the general interface semantics above. This allows applications to
play with the idea and validate that it would work for them. Allowing
further discussion to continue on how to best implement or best allow
the implementation to evolve in the kernel.

Even so, I'm very interested in any discussion about how to manage
volatile ranges optimally.

Before describing the different management approaches I've tried,
there are some abstract properties and considerations that need to
be kept in mind:

* Range-granular Purging:
	Since volatile ranges can be reclaimed at any time, the
	question of how the kernel should reclaim volatile data
	needs to be addressed.	When a large data range  is marked
	as volatile, if any single page in that range is purged,
	the application will get an error when it marks the range
	as non-volatile.  Thus when any single page in a range
	is purged, the "value" of the entire range is destroyed.
	Because of this property, it makes most sense to purge the
	entire range together.

* Coalescing of adjacent volatile ranges:
	With volatile ranges, any overlapping ranges are always
	coalesced. However, there is an open question of what to
	do with neighboring ranges. With Android's approach, any
	neighboring ranges were coalesced into one range.  I've since
	tweaked this so that adjacent ranges are coalesced only if
	both have not yet been purged (or both are already purged).
	This avoids throwing away fine data just because its next
	to data that has already been tossed.  Not coalescing
	non-overlapping ranges is also an option I've considered,
	as it better follows the applications wishes, since as
	the application is providing these non-overlapping ranges
	separately, we should probably also purge them separately.
	The one complication here is that for userlands-sake, we
	manage volatile ranges at a byte level. So if an application
	marks one an a half pages of data as volatile, we only purge
	pages that are entirely volatile. This avoids accidentally
	purging non-volatile data on the rest of the page.  However,
	if an array of sub-page sized data is marked volatile one by
	one, coalescing the ranges allows us to purge a page that
	consists entirely of multiple volatile ranges.	So for now
	I'm still coalescing assuming the neighbors are both unpurged,
	but this behavior is open to being tweaked.

* Purging order between volatile ranges:
	Again, since it makes sense to purge all the complete
	pages in a range at the same time, we need to consider the
	subtle difference between the least-recently-used pages vs
	least-recently-used ranges. A single range could contain very
	frequently accessed data, as well as rarely accessed data.
	One must also consider that the act of marking a range as
	volatile may not actually touch the underlying pages. Thus
	purging ranges based on a least-recently-used page may also
	result in purging the most-recently used page.

	Android addressed the purging order question by purging ranges
	in the order they were marked volatile. Thus the oldest
	volatile range is the first range to be purged. This works
	well in the Android  model, as applications aren't supposed
	to access volatile data, so the least-recently-marked-volatile
	order maps well to the least-recently-used-range.

	However, this assumption doesn't hold with the lazy SIGBUS
	notification method, as pages in a volatile range may continue
	to be accessed after the range is marked volatile.  So the
	question as to what is the best order of purging volatile
	ranges is definitely open.

	Abstractly the ideal solution might be to evaluate the
	most-recently used page in each range, and to purge the range
	with the oldest recently-used-page, but I suspect this is
	not something that could be calculated efficiently.

	Additionally, in my conversations with Taras, he pointed out
	that if we are using a one-application-at-a-time UI model,
	it would be ideal to discourage purging volatile data used by
	the current application, instead prioritizing volatile ranges
	from applications that aren't active. However, I'm not sure
	what mechanism could be used to prioritize range purging in
	this fashion, especially considering volatile ranges can be
	on data that is shared between applications.

* Volatile range purging order relative to non-volatile pages:
	Initially I had proposed that since applications had offered
	data up as unused, volatile ranges should be purged before we
	try to free any other pages in the system.  At Plumbers, Andrea
	pointed out that this doesn't make much sense, as there may be
	inactive file pages from some streaming file data which are not
	going to be used any time soon, and would be a better candidate
	to free then an application's volatile pages. This sounded
	quite reasonable, so its likely we need to balance volatile
	purging with freeing other pages in the system. However, I do
	think it is advantageous to purge volatile pages before we
	free any dirty pages that must be laundered, as part of the
	goal of volatile pages is to avoid extra io. Although from
	my reading of shrink_page_list in vmscan.c I'm not sure I see
	if/how we prioritize freeing clean pages prior to dirty ones.

So with that background covered, on to discussing actual
implementations.

Implementation Details:
---------------------------------------------

There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.

1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.

2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.	When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.

Additionally, there may be ways to defer and batch the page migration
so that applications don't suffer the extra cost, but this solution
may be limited or could  cause some strange behavior, as we can't
defer the unmark method, as we don't want pages to be purged after
the application thinks they were unmarked.

Whew, that was long...

Anyway, if you got this far and are still interested, I'd be greatly
appreciate  hearing of any other suggested implementations, or ways
around the drawbacks of the already tried approaches.

thanks
-john

For this v7 patchset revision the changes are as follows:
* Dropped the LRU_VOLATILE approach for now so we can focus on
  getting the general interface semantics agreed upon
* Converted to using byte ranges rather then page ranges to make
  userland's life easier.	
* Add SIGBUS on purged page access behavior, allowing for access
  of volatile data without having to unmark it.

Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
Cc: Robert Love <rlove@xxxxxxxxxx>
Cc: Mel Gorman <mel@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Dmitry Adamushko <dmitry.adamushko@xxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Neil Brown <neilb@xxxxxxx>
Cc: Andrea Righi <andrea@xxxxxxxxxxxxxxx>
Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Mike Hommey <mh@xxxxxxxxxxxx>
Cc: Taras Glek <tglek@xxxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx <linux-mm@xxxxxxxxx>

John Stultz (3):
  [RFC] Add volatile range management code
  [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  [RFC] ashmem: Convert ashmem to use volatile ranges

 drivers/staging/android/ashmem.c |  335 +---------------------
 fs/open.c                        |    3 +-
 include/linux/falloc.h           |    7 +-
 include/linux/volatile.h         |   46 +++
 mm/Makefile                      |    2 +-
 mm/shmem.c                       |  120 ++++++++
 mm/volatile.c                    |  580 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 763 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>