[PATCH 0/8] Volatile Ranges (v8?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey everyone.

I know its been quite awhile. But Minchan and I have been doing a
fair amount of discussing offlist since lsf-mm, trying to come
to agreement on the semantics for the volatile ranges interface,
and after circling around each other's arguments for awhile (he'd
suggest and idea, I'd disagree, then I'd come around to agree just as
he would begin to disagree :) I think things have started to converge
pretty nicely, at least as far as the interface goes.

Some of the more interesting and challenging ideas we've explored
recently have been given up for now, mostly so we can get some core
agreed functionality moving upstream. We may still want to revisit
those ideas before the final push, but for now, we're focusing on
the parts we agree on that we think have a chance at eventually
being merged.

If you've read some of my earlier summaries, you'll likely find
this patchset much simplified:
* We only have one interface: vrange(address, len, mode, *purged),
  which is used in a method similar to madvise on both file or
  anonymous pages.
* We no longer have a concept of anon-only or private-volatility.
  Despite the potential performance gains that Minchan liked in
  avoiding the mmap_sem,the semantics were often confusing when using
  private volatility on non-anonymous pages.
* We no longer have behavior flags. Potential extensions can still be
  done via introducing new mode flags.

The patch set has also been heavily reworked and reordered to make
more iterative sense and hopefully to be easier to review.

Patches 1-5 are what we're wanting the most feedback on, since this
is the area dealing with the userland interface and the semantics of
how volatile ranges behave.

Patches 6-8 provide the back-end purging logic, which is likely
to change, and is provided only so folks can start playing around
with a functional patch series. It currently has some limitations,
like it doesn't purge anonymous pages on swap free systems.
Additionally, the newly integrated file page purging logic likely has
issues still to be resolved.

Overall, We still have the following TODOS with the patchset:
* Come to consensus on the best way to avoid inheriting mm_struct
  volatility when the underlying vmas change. (see patch 4 in this
  series)
* Ensure we zap underlying file page (ala  truncate_inode_pages_range)
  when we purge file pages - this make purging similar to file hole
  punching and ensures we don't find stale data later. (patch 7)
* Avoid lockdep warnings caused by allocations made while holding vroot
  lock triggering reclaim which could try to purge volatile ranges,
  grabbing the same vroot lock.  Minchan added a GFP_NO_VRANGE flag,
  but we've not hooked that up into the reclaim logic to avoid purging.
* Re-integrate Minchan's logic to purge anonymous pages on swapfree
  systems (dropped for this release to keep things simpler for review)


Any feedback and review would be greatly appreciated!

thanks!
-john


Volatile Ranges
============== 
Volatile ranges provide a way for userland applications to provide
hints to the kernel, about memory that is not immediately in use and
can be regenerated if needed.

After marking a range as volatile, if the kernel experiences memory
pressure, the kernel can then purge those pages, freeing up additional
space.  Userland can also tell the kernel it wants to use that memory
again, by marking the range non-volatile, after which the kernel will
not purge that memory.

If the kernel has already purged the memory when userland requests
it be made non-volatile, the kernel will return a warning value to
notify userland that the data was lost and must be regenerated.

If userland accesses memory marked volatile that has not been purged,
it will get the values it expects.

However, if userland touches volatile memory that has been purged, the
kernel will send it a SIGBUS.  This makes it possible for userland to
handle the SIGBUS, marking the memory as non-volatile and re-generating
it as needed before continuing.

In some ways, the kernel's purging of memory can be considered
as similar to a delayed MADV_DONTNEED or FALLOC_FL_PUNCH_HOLE
operation, which can be canceled. Thus similarly to MADV_DONTNEED
or FALLOC_FL_PUNCH_HOLE, operations done on file data that is mmaped
shared will be seen by other processes who have that file mapped. Thus
if an application marks shared  mmaped file data as volatile, that
volatility state is also shared across other tasks. This allows tasks
to coordinate for one task to mark  shared file data as volatile, and a
second task to be able to unmark it if necessary. If the kernel purges
volatile file data that was marked by one task, all tasks sharing
that data will see the data as purged, and will have to mark it as
non-volatile before accessing it or will have to handle the SIGBUS.

All volatility on files is cleared when the last fd handle is closed.


Interface:
The vrange syscall is defined as follows:

int vrange(unsigned long address, size_t length, int mode, int* purged)

address:	Starting address in the process where memory will be
	 	marked. This must be page aligned

length:		Length of the range to be marked. This must be in page
		size units.

mode:
 VRANGE_VOLATILE:	Marks the specified range as volatile, and
			able to be purged.
 VRANGE_NONVOLATILE:	Marks the specified range as non-volatile. If
			any data in that range was volatile and has
			been purged, 1 will be returned in the purged
			pointer.

purged:		Pointer to an integer that will be set to 1 if any data
		in the range being marked non-volatile has been purged
		and is lost. If it is zero, then no data in the
		specified range has been lost.

Return values:
		Returns the number of bytes marked or unmarked. Similar
		to write(), it  may return fewer bytes then specified
		if it ran into a problem.

		If an error (negative value) is returned,no changes
		were made.

Errors:
	EINVAL:
		* address is not page-aligned, or is invalid.
		* length is not a multiple of the page size.
		* length is negative.
	ENOMEM:
		* Not enough memory.
	EFAULT:
		* Purge pointer is invalid.




Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
Cc: Robert Love <rlove@xxxxxxxxxx>
Cc: Mel Gorman <mel@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Dmitry Adamushko <dmitry.adamushko@xxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Neil Brown <neilb@xxxxxxx>
Cc: Andrea Righi <andrea@xxxxxxxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Mike Hommey <mh@xxxxxxxxxxxx>
Cc: Taras Glek <tglek@xxxxxxxxxxx>
Cc: Dhaval Giani <dgiani@xxxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx <linux-mm@xxxxxxxxx>


John Stultz (2):
  vrange: Add vrange support for file address_spaces
  vrange: Clear volatility on new mmaps

Minchan Kim (6):
  vrange: Add basic data structure and functions
  vrange: Add vrange support to mm_structs
  vrange: Add new vrange(2) system call
  vrange: Add GFP_NO_VRANGE allocation flag
  vrange: Add method to purge volatile ranges
  vrange: Send SIGBUS when user try to access purged page

 arch/x86/include/asm/pgtable_types.h   |   2 +
 arch/x86/syscalls/syscall_64.tbl       |   1 +
 fs/file_table.c                        |   5 +
 fs/inode.c                             |   2 +
 include/asm-generic/pgtable.h          |  11 +
 include/linux/fs.h                     |   2 +
 include/linux/gfp.h                    |   7 +-
 include/linux/mm_types.h               |   5 +
 include/linux/rmap.h                   |  12 +-
 include/linux/swap.h                   |   1 +
 include/linux/vrange.h                 |  60 +++
 include/linux/vrange_types.h           |  19 +
 include/uapi/asm-generic/mman-common.h |   3 +
 init/main.c                            |   2 +
 kernel/fork.c                          |   6 +
 lib/Makefile                           |   2 +-
 mm/Makefile                            |   2 +-
 mm/ksm.c                               |   2 +-
 mm/memory.c                            |  23 +-
 mm/mmap.c                              |   5 +
 mm/rmap.c                              |  30 +-
 mm/swapfile.c                          |  36 ++
 mm/vmscan.c                            |  16 +-
 mm/vrange.c                            | 731 +++++++++++++++++++++++++++++++++
 24 files changed, 963 insertions(+), 22 deletions(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 include/linux/vrange_types.h
 create mode 100644 mm/vrange.c

-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]