[LSF/MM TOPIC] lru_lock scalability

Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> · Wed, 31 Jan 2018 23:44:29 -0500

I'd like to propose a discussion of lru_lock scalability on the mm track.  Since this is similar to Laurent Dufour's mmap_sem topic, it might make sense to discuss these around the same time.

On large systems, lru_lock is one of the hottest locks in the kernel, showing up on many memory-intensive benchmarks such as decision support.  It also inhibits scalability in many of the mm paths that could be parallelized, such as freeing pages during exit/munmap and inode eviction.

I'd like to discuss the following two ways of solving this problem, as well as any other approaches or ideas people have.

1) LRU batches
--------------

This method, developed with Steve Sistare, is described in this RFC series:

  https://lkml.org/lkml/2018/1/31/659

It introduces an array of locks, with each lock protecting certain batches of LRU pages.

       *ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
       |           ||           ||           ||
        \ batch 1 /  \ batch 2 /  \ batch 3 /

In this ASCII depiction of an LRU, a page is represented with either '*' or 'o'.  An asterisk indicates a sentinel page, which is a page at the edge of a batch.  An 'o' indicates a non-sentinel page.

To remove a non-sentinel LRU page, only one lock from the array is required.  This allows multiple threads to remove pages from different batches simultaneously.  A sentinel page requires lru_lock in addition to a lock from the array.

Full performance numbers appear in the last patch in the linked series above, but this prototype allows a microbenchmark to do up to 28% more page faults per second with 16 or more concurrent processes.

2) Locking individual pages
---------------------------

This work, developed in collaboration with Yossi Lev and Dave Dice, locks individual pages on the LRU for removal.  It converts lru_lock from a spinlock to a rw_lock, using the read mode for concurrent removals and the write mode for other operations, i.e. those that need exclusive access to the LRU.

We hope to have a prototype and performance numbers closer to the time of the summit.

Here's a more detailed description of this approach from Yossi Lev, with code snippets to support the ideas.

/**
* The proposed algorithm for concurrent removal of pages from an LRU list
* relies on the ability to lock individual pages during removals.  In the
* implementation below, we support that by storing the NULL value in the next
* field of the page's lru field.  We can do that because the next field value
* is not needed once a page is locked. If for some reason we need to maintain
* the value of the next pointer during page removal, we could use its LSB for
* the locking purpose, or any other bit in the page that we designate for this
* purpose.
*/

#define PAGE_LOCKED_VAL    NULL

#define IS_LOCKED(lru)     (lru.next == PAGE_LOCKED_VAL)

/**
* We change the type of lru_lock from a regular spin lock (spinlock_t) to a read-write
* spin lock (rwlock_t).  Locking in read mode allows some operations to run
* concurrently with each other, as long as functions that requires exclusive
* access do not hold the lock in write mode.
*/
typedef struct pglist_data {
   // ...

   /* Write-intensive fields used by page reclaim */
   ZONE_PADDING(_pad1_)
   rwlock_t    lru_lock;

   // ...
} pg_data_t;

static inline rwlock_t *zone_lru_lock(struct zone *zone) {
	return &zone->zone_pgdat->lru_lock;
}

/**
* A concurrent variant of del_page_from_lru_list
*
* Unlike the regular del_page_from_lru_list function, which must be called while
* the lru_lock is held exclusively, here we are allowed to hold the lock in read
* mode, allowing concurrent runs of the del_page_from_lru_list_concurrent function.
*
* The implementation assumes that the lru_lock is held in either read or write
* mode on function entry.
*/
void del_page_from_lru_list_concurrent(struct page *page,
                                      struct lruvec *lruvec,
                                      enum lru_list lru) {
   /**
    * A removal of a page from the lru list only requires changing the prev and
    * next pointers in its neighboring pages.  Thus, the only case where we
    * need to synchronize concurrent removal of pages is when the pages are
    * adjacent on the LRU.
    *
    * The key idea of the algorithm is to deal with this case by locking
    * individual pages, and requiring a thread that removes a page to acquire
    * the locks on both the page that it removes, and the predecessor of that
    * page. Specifically, the removal of page P1 with a predecessor P0 in the
    * LRU list requires locking both P1 and P0, and keeps P1 locked until P1's
    * removal is completed.  Only then the lock on P0 is released, allowing it
    * to be removed as well.
    *
    * Note that while P1 is removed by thread T, T can manipulate the lru.prev
    * pointer of P1's successor page, even though T does not hold the lock on
    * that page.  This is safe because the only thread other than T that may be
    * accessing the lru.prev pointer of P1's successor page is a thread that
    * tries to remove that page, and that thread cannot proceed with the
    * removal (and update the lru.prev of the successor page) as long as P1 is
    * the predecessor, and is locked by T.
    *
    * Since we expect the case of concurrent removal of adjacent pages to be
    * rare, locking of adjacent pages is expected to succeed without waiting in
    * the common case; thus we expect very little or no contention during
    * parallel removal of pages from the list.
    *
    * Implementation note: having the information of whether a page is locked
    * be encoded in the list_head structure simplifies the function code a bit,
    * as it does not need to deal differently with the corner case where we
    * remove the page at the front of the list (i.e. the most recently used
    * page); this is because the pointers to the head and tail of a list also
    * reside in a field of type list_head (stored in lruvec), so the
    * implementation treats this field as if it is the lru field of a page, and
    * locks it as the predecessor of the head page when it is removed.
    */

   /**
    * Step 1: lock our page (i.e. the page we need to remove); we only fail to
    * do so if our successor is in the process of being removed, in which case
    * we wait for its removal to finish, which will unlock our page.
    */
   struct list_head *successor_p = READ_ONCE(page->lru.next);
   while (successor_p == PAGE_LOCKED_VAL ||
          cmpxchg(&page->lru.next,
                  successor_p, PAGE_LOCKED_VAL) != successor_p) {

       /* our successor is being removed, wait for it to finish */
       cpu_relax();
       successor_p = READ_ONCE(page->lru.next)
   }

   /**
    * Step 2: our page is locked, successor_p holds the pointer to our
    * successor, which cannot be removed while our page is locked (as we are
    * the predecessor of that page).
    * Try locking our predecessor. Locking will only fail if our predecessor is
    * being removed (or was already removed), in which case we wait for its
    * removal to complete, and continue by trying to lock our new predecessor.
    *
    * Notes:
    *
    * 1. We detect that the predecessor is locked by checking whether its next
    * pointer points to our page's node (i.e., our lru field).  If a thread is
    * in the process of removing our predecessor, the test will fail until we
    * have a new predecessor that is not in the process of being removed.
    * We therefore need to keep reading the value stored in our lru.prev field
    * every time an attempt to lock the predecessor fails.
    *
    * 2. The thread that removes our predecessor can update our lru.prev field
    * even though it doesn't hold the lock on our page.  The reason is that we
    * only need to reference the lru.prev pointer of a page when we remove it
    * from the list, our page, is only used by a thread that removes our page,
    * and that thread cannot proceed with the removal as long as the
    * predecessor is
    *
    * 3. This is the part of the code that is responsible to check for the
    * corner case of the head page removal if locking a page was not be a
    * simple operation on a list_head field, because the head page does not
    * have a predecessor page that we can lock. We can avoid dealing with this
    * corner case because a) we lock a page simply by manipulating the next
    * pointer in a field of type list_head, and b) the lru field of the head
    * page points to an instance of the list_head type (this instance is not
    * part of a page, but it still has a next pointer that points to the head
    * page, so we can lock and unlock it just as if it is the lru field of a
    * predecessor page).
    *
    */
   struct list_head *predecessor_p = READ_ONCE(page->lru.prev);
   while (predecessor_p->next != &page->lru ||
          cmpxchg(&predecessor_p->next,
                  &page->lru, PAGE_LOCKED_VAL) != &page->lru) {
       /**
        * Our predecessor is being removed; wait till we have a new unlocked
        * predecessor
        */
       cpu_relax();
       predecessor_p = READ_ONCE(page->lru.prev);
   }

   /**
    * Step 3: we now hold the lock on both our page (the one we need to remove)
    * and our predecessor page:  link together our successor and predecessor
    * pages by updating their prev and next pointers, respectively.  At that
    * point our node is removed, and we can safely release the lock on it by
    * updating its next pointer.
    *
    * Critically, we have to update the successor prev pointer _before_
    * updating the predecessor next pointer. The reason is that updating the
    * next pointer of a page unlocks that page, and unlocking our predecessor
    * page will allow it to be removed.  Thus, we have to make sure that both
    * pages are properly linked together before releasing the lock on our
    * predecessor.
    *
    * We guarantee that by setting the successor prev pointer first, which will
    * now point to our predecessor while it is still locked.  Thus, neither of
    * these pages can yet be removed. Only then we set the predecessor next
    * pointer, which also relinquishes our acquisition of its lock.
    *
    * For similar reasons, we only release the lock on our own node after the
    * successor points to its new predecessor.  (With the way the code is
    * written above, this is not as critical because the successor will still
    * fail to remove itself as long as our next pointer does not point to it,
    * so having any value other than the pointer to the successor in our next
    * field is safe.  That said, there is no reason to touch our next (or prev)
    * pointers before we're done with the page removal.
    */
   WRITE_ONCE(successor_p->prev, predecessor_p);
   WRITE_ONCE(predecessor_p->next, successor_p);

   /**
    * Cleanup (also releases the lock on our page).
    */
   page->lru.next = LIST_POISON1;
   page->lru.prev = LIST_POISON2;
}

/*
    ------------------
    Further Discussion
    ------------------

    As described above, the key observation behind the algorithm is that simple
    lru page removal operations can proceed in parallel with each other with
    very little per-page synchronization using the compare-and-swap (cmpxchg)
    primitives.  This can only be done safely, though, as long as other
    operations do not access the lru list at the same time.
    To enable concurrent removals only when it is safe to do so, we replace the
    spin lock that is used to protect the lru list (aka lru_lock) with a
    read-write lock (rwlock_t), that can be acquired in either exclusive mode
    (write acquisition), or shared mode (read-acquisition). Using the rwlock
    allows us to run many page removal operations in parallel, as long as no
    other operations that requires exclusive operations are being run.

    Below we discuss a few key aspects properties of the algorithm.

    Safety and synchronization overview
    ===================================

    1. Safe concurrent page removal: because page removals from the list do not
    need to traverse the list, but only manipulate the prev and next pointers of
    their neighboring pages, most removals can be done in parallel without
    synchronizing which each other; however, if two pages that are adjacent to
    each other in the list are being removed concurrently, some care should be
    taken to ensure that we correctly handle the data race over the pages'
    lru.next and lru.prev fields.

    The high level idea of synchronizing page removal is simple: we allow
    individual pages to be locked, and in order to remove a page P1 with a
    predecessor P0 and successor P2 (i.e., P0 <-> P1 <-> P2), the thread
    removing P1 has to lock both P1 and P0 (i.e., the page to be removed, and
    its predecessor).  Note that even though a thread T that removes P1 is not
    holding the lock of P2, P2 cannot be removed while T is removing P1, as P1
    is the predecessor of P2 in the list, and it is locked by T while it is
    being removed.  Thus, once T is holding the locks on both P1 and P0, it can
    manipulate the next and prev pointers between all 3 nodes.  The details of
    how the locking protocol and the pointer manipulation is handled are
    described in the code documentation above.

    2. Synchronization between page removals and other operations on the list:
    as mentioned above, the use of a read-write lock guarantees that only
    removal operations are running concurrently together, and the list is not
    being accessed by any other operation during that time.  We note, though,
    that a thread is allowed to acquire the read-write lock in exclusive mode
    (that is, acquire it for writing), which guarantees that it has exclusive
    access to the list. This can be used not only as a contention control
    mechanism (for extreme cases where most concurrent removal operations are
    for adjacent nodes), but also as a barrier to guarantee that all removal
    operations are completed).  This can be useful, for example, if we want to
    reclaim the memory used by previously removed pages and store random data in
    it, that cannot be safely observed by a thread during a removal operation.
    A simple write acquisition of the new lru lock will guarantee that all
    threads that are in the midst of removing pages completes their operation
    before the lock is being acquired in exclusive mode.

    Progress
    ========

    1. Synchronization granularity: the synchronization mechanism described
    above is extremely fine grained --- a removal of a page only requires
    holding the lock on that page and its predecessor.  Thus, in the example of
    a list with pages P0<->P1<->P2 in it, the removal of P1 can only delay the
    removal of P0 and P2, but not the removal of P2's successor or P0's
    predecessor; those can take place concurrently with the removal of P1.

    Because we expect concurrent removal of adjacent pages to be rare, we expect
    to see very little contention under normal circumstances, and for most
    remove operations to be executed in parallel without needing to wait for one
    another.

    2. Deadlock and live locks: the algorithm is not susceptible to either
    deadlocks or livelocks.  The former cannot occur because when a thread can
    only block the removal of its successor node, and the first (head) page can
    always be removed without waiting as it does not have a real predecessor
    (see comments in the code above on handling this case).  As for live locks,
    thread that acquired a lock never releases it before they finish their
    removal operation.  Thus, threads never "step back" to yield for other
    threads, and can only make progress towards finishing the removal of their
    page, so we should never be in a live lock scenario.

    3. Removal vs. other operations on the list: the lru read-write lock in write
    mode prevents any concurrency between operations on the list other than page
    removal.  However, in theory, a stream of removal operations can starve
    other operation that require exclusive access to the list; this may happen
    if the removal operations are allowed to keep acquiring and releasing the
    lock in shared mode, keeping the system in a state where there is always at
    least one removal operation in progress.
    It is critical, then, that we use a read-write lock that supports an
    anti-starvation mechanism, and in particular protects the writers from being
    starved by readers.  Almost all read write locks that are in use today have
    such a mechanism; a common idiom is that once a thread is waiting for the
    lock to be acquired in exclusive mode, threads that are trying to acquire
    the lock in shared mode are delayed until the thread that requires exclusive
    access acquires and releases the lock for writing.

    4. Handling extreme contention cases: in extreme cases that incur high wait
    time for locking individual pages, threads that are waiting to remove a page
    can always release their shared ownership of the lru lock, and request an
    exclusive ownership of it instead (that is, acquire the lock in for
    writing).  A successful write acquisition disallows any other operations
    (removals or others) to run in parallel with that thread, but it may be used
    as a fall back solution in extreme cases of high contention on a small
    fragment of the list, where a single acquisition of the global list lock may
    prove to be more beneficial than a series of individual pages locking.

    While we do not expect to see it happening in practice, it is important to
    keep that option in mind, and if necessary implement a fall back mechanism
    that detects the case where most remove operations are waiting to acquire
    individual locks, and signal them to move to a serial execution mode (one
    removal operation at a time).

    Integration and compatibility with existing code
    ================================================

    One important property of the new algorithm is that it requires very few
    local changes to existing code.  The new concurrent function for page
    removal is very short, and the only global change is replacing the lru spin
    lock with a read write lock, and changing the acquisition for the purpose of
    page removal to a read acquisition.

    Furthermore, the implementation suggested above does not add any additional
    fields or change the structure of existing data structures, and hence the
    old, serial page removal function is still compatible with it. Thus, if a
    page removal operation needs to acquire the lock exclusively for some
    reason, it can simply call the regular, serial page removal function, and
    avoid the additional synchronization overhead of the new parallel variant.
*/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>