Re: [LSF/MM ATTEND] Fadvise Extensions for Directory Level Cache Cleaning and POSIX_FADV_NOREUSE

Li Wang <liwang@xxxxxxxxxxxxxxx> · Fri, 24 Jan 2014 15:24:37 +0800

On 2014/1/23 21:19, Qing Wei wrote:
Hi,

On 01/20/2014 10:56 PM, Li Wang wrote:

    Hello,
      It will be appreciated if I have a chance to discuss the fadvise
    extension topic at the incoming LSF/MM summit. I am also very
    interested in the topics on VFS, MM, SSD optimization as well as ext4,
    xfs, ceph and so on.
     In the last year, I have been involved in Ceph development, the
    features done/ongoing include punch hole support, inline data
    support, cephfs quota support, cephfs fuse file lock support etc, as
    well as some bug fixes and performance evaluations.

    The proposal is below, comments/suggestions are welcome.

    Fadvise Extensions for Directory Level Cache Cleaning and
    POSIX_FADV_NOREUSE

    1 Motivation

    1.1 Directory Level Cache Cleaning

    VFS relies on LRU-like page cache eviction algorithm to reclaim cache
    space, since LRU is not aware of application semantics, it may
    incorrectly evict going-to-be referenced pages out, resulting in
    severe
    performance degradation due to cache thrashing, especially under high
    memory pressure situation. Applications have the most semantic
    knowledge, they can always do better if they are given a chance. This
    motivates to endow the applications more abilities to manipulate the
    vfs cache.

    Currently, Linux support file system wide cache cleaning by virtue of
    proc interface 'drop-caches', but it is very coarse granularity and
    was originally proposed for debugging. The other is to do file-level
    page cache cleaning through 'fadvise', however, since there is no
    way of
    determining whether a path name is in the dentry cache, simply calling
    fadvise(name, DONTNEED) will very likely pollute the cache rather
    than cleaning it. Even there is a cache query API available, it will
    incur heavy system call overhead, especially in massive small-file
    situations. This motivates to extend fadvise() to support directory
    level cache cleaning. Currently, the original implementation is
    available at https://lkml.org/lkml/2013/12/30/147, and received some
    constructive comments. We think there are some designs need be put
    under discussion, and we summarize them in Section 2.1.

    1.2 POSIX_FADV_NOREUSE

    POSIX_FADV_NOREUSE is useful for backup and data streaming
    applications.
    There are already some efforts on POSIX_FADV_NOREUSE implementation,
    the latest seems to be https://lkml.org/lkml/2012/2/11/133. The
    alternative ways can be (a) Use fadvise(DONTNEED) instead; (b) Use
    container-based approach, such as setting memory.file.limit_in_bytes.
    However, both (a) and (b) have limitations. (a) may impolitely destroy
    other application's work set, which is not a desirable behavior;
    (b) is
    kind of rude, and the threshold may have to be  carefully tuned,
    otherwise it may cause applications to start swapping  or even worse.
    In addition, we are not sure if it shares the same issue  with (a).
    This motivates to develop a simple yet efficient POSIX_FADV_NOREUSE
    implementation.

    2 Designs to be discussed

    Since these are both suggestive interfaces, the overall idea
    behind our
    design is to minimize the modification to current MM magic, stay the
    implementation as simple as possible.

    2.1 Directory Level Cache Cleaning

    For directory level cache cleaning, fadivse(fd, DONTNEED) will clean
    all the page caches as well as unreferenced dentry caches and inode
    caches inside the directory fd.

    (1) For page cache cleaning, the policy in our original design is to
    collect those inodes not on any LRU list into our private list for
    further cleaning. However, as pointed out by Andrew and Dave, most
    inodes are actually on the LRU list, hence this policy will leave many
    inodes fail to be processed. And, since we want to reuse the
    inode->i_lru rather than adding a new list_head field into inode, we
    will encounter a problem that we can not determine whether an inode is
    on superblock LRU list or on our private list. While a fadvise()
    caller
    A is trying to collect an inode, it may happen that another fadvise()
    caller B has already gathered the inode into his private LRU list,
    then
    it will end up that A grabs inode from B's list, and the worse
    thing is,
    the operations on B'list are not synchronized within multiple
    fadvise()
    callers. To address this, We have two candidates,

    (a) Introduce a new inode state I_PRIVATE, indicating the inode is
    on a
    private list. While collecting one inode into private list, the
    flag is
    set on it, and cleared after finishing page cache invalidation.
    Fadvise() caller will check the flag prior to collecting one inode
    into
    his private list. This avoids the race between one fadvise() caller is
    adding a new inode to his list and another caller is grabbing a inode
    from this list.

    (b) Introduce a global list as well as a global lock. The inodes to be
    manipulated are always collected into the global list, protected
    by the
    global lock. Given the cache cleaning is not a frequent operation, the
    performance impact is negligible.

    (2) For dentry cache cleaning, shrink_dcache_parent() meets most
    of our
    demands except it does not take permission into account, the caller
    should not touch the dentries and inodes which he does not own
    appropriate permission. There are also two ways to perform the check,

    (a) Check if the caller has permission on parent directory, i.e,
    inode_permission(dentry->d_parent->d_inode, MAY_WRITE | MAY_EXEC)

    (b) Check if the caller has permission on corresponding inode, i.e,
    (inode_owner_or_capable(dentry->d_inode) || capable(CAP_SYS_ADMIN))

    (3) For dentry cache cleaning, if dentries are freed, there seems no
    easy way to walk all inodes inside a specific directory, our idea lies
    in that before freeing those unreferenced dentries, gather the inodes
    referenced by them into a private list, __iget() the inodes and mark
    I_PRIVATE on (if the I_PRIVATE scheme is acceptable). Thereafter from
    where we can still find those inodes to further free them.

    (4) For inode cache cleaning, in most situations, iput_final()
    will put
    unreferenced inodes into superblock lru list rather than freeing them.
    To free the inodes in our private list, it seems there is not a handy
    API to use. The process could be, for each inode in our list, hold the
    inode lock, clear I_PRIVATE, detach from list, atomic decrease its
    reference count. If the reference count reaches zero, there are two
    possible ways,

    (a) Introduce a new inode state I_FORCE_FREE, and mark it on, then
    pass
    the inode into iput_final(). iput_final() is with tiny
    modifications to
    be able to recognize the flag, who will then invoke evict() to
    free the
    inode rather than adding it to super block LRU list.

    (b) Wrap iput_final() into __iput_final(struct inode *inode, bool
    force_free), we call __iput_final(inode, TRUE), define iput_final() to
    static inline __iput_final(inode, FALSE).

    2.2 POSIX_FADV_NOREUSE Implementation

    Our key idea behind is to translate 'The application will access the
    page once' into 'The access leaves no side-effect on the page'. For
    current MM implementation, normal access will has side-effect on the
    page accessed, i.e, it will increase the temperature of the page,
    in a way of from inactive to active or from unreferenced to
    referenced.
    Against normal access, NOREUSE is intended to tell the MM system that
    the access will leave the page as it is. This can be detailed as
    follows,

    (a) If a page is accessed for the first time, after NOREUSE access, it
    is kept inactive and unreferenced, then it will potentially get
    reclaimed soon since it has a lowest temperature, unless a later
    NON-NOREUSE access increases its temperature. Here we do not
    explicitly immediately free the page after access, this is for three
    reasons, the first is the semantics of NOREUSE differs from DONTNEED,
     NOREUSE does not mean the page should be dropped  immediately; the
    second is synchronously freeing the page will more or less slow down
    the read performance; And the last, a near-future reference of the
    page
    by other applications will have a chance to hit in the cache.

    (b) If a page is accessed before, in other words, it is active or
    referenced, then it may belong to the work set of other applications,
    and will very likely be accessed again. NOREUSE just makes a silent
    access, without changing any status of the page.

    Another assumption is that file wide NOREUSE is enough to capture most
     of the usages, the fine granularity of interval-level NOREUSE is not
    desirable given its rare use and its implementation complexity. So
    this
    results in the following simple NOREUSE implementation,

    (1) Introduce a new fmode FMODE_NOREUSE, set it on when calling
    fadvise(NOREUSE)

So when will this flag be cleared? Do you need clear it while setting
FMODE_RANDOM, FMODE_NORMAL, FMODE_SEQ etc, like
https://lkml.org/lkml/2012/2/11/13 
<https://lkml.org/lkml/2012/2/11/133> does?
It could be under discussion. FMODE_RANDOM, FMODE_NORMAL,
FMODE_SEQ and WILLNEED are all supposed to guide read ahead,
something happen before read. NOREUSE is supposed to suggest something
after read, so they seems to not to contradict with each other. For example,
FMODE_SEQ | FMODE_NOUSE could give better indication of  the behavior
of rsync. For DONTNEED, it is done synchronously, it seems not to 
contradict
with NOREUSE neither.

    (2) do_generic_file_read():
    From:
    if (prev_index != index || offset != prev_offset)
        mark_page_accessed(page);
    To:
    if ((prev_index != index || offset != prev_offset) && !(filp->f_mode &
    FMODE_NOREUSE))
        mark_page_accessed(page);
        There are no more than ten LOC to go.

    Cheers,
    Li Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html