Re: [LSF/MM ATTEND] Fadvise Extensions for Directory Level Cache Cleaning and POSIX_FADV_NOREUSE

Qing Wei <weiqing369@xxxxxxxxx> · Thu, 23 Jan 2014 21:19:54 +0800

Hi,

On 01/20/2014 10:56 PM, Li Wang wrote:

Hello,

  It will be appreciated if I have a chance to discuss the fadvise

extension topic at the incoming LSF/MM summit. I am also very

interested in the topics on VFS, MM, SSD optimization as well as ext4,

xfs, ceph and so on.

 In the last year, I have been involved in Ceph development, the

features done/ongoing include punch hole support, inline data

support, cephfs quota support, cephfs fuse file lock support etc, as

well as some bug fixes and performance evaluations.

The proposal is below, comments/suggestions are welcome.

Fadvise Extensions for Directory Level Cache Cleaning and

POSIX_FADV_NOREUSE

1 Motivation

1.1 Directory Level Cache Cleaning

VFS relies on LRU-like page cache eviction algorithm to reclaim cache

space, since LRU is not aware of application semantics, it may

incorrectly evict going-to-be referenced pages out, resulting in severe

performance degradation due to cache thrashing, especially under high

memory pressure situation. Applications have the most semantic

knowledge, they can always do better if they are given a chance. This

motivates to endow the applications more abilities to manipulate the

vfs cache.

Currently, Linux support file system wide cache cleaning by virtue of

proc interface 'drop-caches', but it is very coarse granularity and

was originally proposed for debugging. The other is to do file-level

page cache cleaning through 'fadvise', however, since there is no way of

determining whether a path name is in the dentry cache, simply calling

fadvise(name, DONTNEED) will very likely pollute the cache rather

than cleaning it. Even there is a cache query API available, it will

incur heavy system call overhead, especially in massive small-file

situations. This motivates to extend fadvise() to support directory

level cache cleaning. Currently, the original implementation is

available at https://lkml.org/lkml/2013/12/30/147, and received some

constructive comments. We think there are some designs need be put

under discussion, and we summarize them in Section 2.1.

1.2 POSIX_FADV_NOREUSE

POSIX_FADV_NOREUSE is useful for backup and data streaming applications.

There are already some efforts on POSIX_FADV_NOREUSE implementation,

the latest seems to be https://lkml.org/lkml/2012/2/11/133. The

alternative ways can be (a) Use fadvise(DONTNEED) instead; (b) Use

container-based approach, such as setting memory.file.limit_in_bytes.

However, both (a) and (b) have limitations. (a) may impolitely destroy

other application's work set, which is not a desirable behavior; (b) is

kind of rude, and the threshold may have to be  carefully tuned,

otherwise it may cause applications to start swapping  or even worse.

In addition, we are not sure if it shares the same issue  with (a).

This motivates to develop a simple yet efficient POSIX_FADV_NOREUSE

implementation.

2 Designs to be discussed

Since these are both suggestive interfaces, the overall idea behind our

design is to minimize the modification to current MM magic, stay the

implementation as simple as possible.

2.1 Directory Level Cache Cleaning

For directory level cache cleaning, fadivse(fd, DONTNEED) will clean

all the page caches as well as unreferenced dentry caches and inode

caches inside the directory fd.

(1) For page cache cleaning, the policy in our original design is to

collect those inodes not on any LRU list into our private list for

further cleaning. However, as pointed out by Andrew and Dave, most

inodes are actually on the LRU list, hence this policy will leave many

inodes fail to be processed. And, since we want to reuse the

inode->i_lru rather than adding a new list_head field into inode, we

will encounter a problem that we can not determine whether an inode is

on superblock LRU list or on our private list. While a fadvise() caller

A is trying to collect an inode, it may happen that another fadvise()

caller B has already gathered the inode into his private LRU list, then

it will end up that A grabs inode from B's list, and the worse thing is,

the operations on B'list are not synchronized within multiple fadvise()

callers. To address this, We have two candidates,

(a) Introduce a new inode state I_PRIVATE, indicating the inode is on a

private list. While collecting one inode into private list, the flag is

set on it, and cleared after finishing page cache invalidation.

Fadvise() caller will check the flag prior to collecting one inode into

his private list. This avoids the race between one fadvise() caller is

adding a new inode to his list and another caller is grabbing a inode

from this list.

(b) Introduce a global list as well as a global lock. The inodes to be

manipulated are always collected into the global list, protected by the

global lock. Given the cache cleaning is not a frequent operation, the

performance impact is negligible.

(2) For dentry cache cleaning, shrink_dcache_parent() meets most of our

demands except it does not take permission into account, the caller

should not touch the dentries and inodes which he does not own

appropriate permission. There are also two ways to perform the check,

(a) Check if the caller has permission on parent directory, i.e,

inode_permission(dentry->d_parent->d_inode, MAY_WRITE | MAY_EXEC)

(b) Check if the caller has permission on corresponding inode, i.e,

(inode_owner_or_capable(dentry->d_inode) || capable(CAP_SYS_ADMIN))

(3) For dentry cache cleaning, if dentries are freed, there seems no

easy way to walk all inodes inside a specific directory, our idea lies

in that before freeing those unreferenced dentries, gather the inodes

referenced by them into a private list, __iget() the inodes and mark

I_PRIVATE on (if the I_PRIVATE scheme is acceptable). Thereafter from

where we can still find those inodes to further free them.

(4) For inode cache cleaning, in most situations, iput_final() will put

unreferenced inodes into superblock lru list rather than freeing them.

To free the inodes in our private list, it seems there is not a handy

API to use. The process could be, for each inode in our list, hold the

inode lock, clear I_PRIVATE, detach from list, atomic decrease its

reference count. If the reference count reaches zero, there are two

possible ways,

(a) Introduce a new inode state I_FORCE_FREE, and mark it on, then pass

the inode into iput_final(). iput_final() is with tiny modifications to

be able to recognize the flag, who will then invoke evict() to free the

inode rather than adding it to super block LRU list.

(b) Wrap iput_final() into __iput_final(struct inode *inode, bool

force_free), we call __iput_final(inode, TRUE), define iput_final() to

static inline __iput_final(inode, FALSE).

2.2 POSIX_FADV_NOREUSE Implementation

Our key idea behind is to translate 'The application will access the

page once' into 'The access leaves no side-effect on the page'. For

current MM implementation, normal access will has side-effect on the

page accessed, i.e, it will increase the temperature of the page,

in a way of from inactive to active or from unreferenced to referenced.

Against normal access, NOREUSE is intended to tell the MM system that

the access will leave the page as it is. This can be detailed as

follows,

(a) If a page is accessed for the first time, after NOREUSE access, it

is kept inactive and unreferenced, then it will potentially get

reclaimed soon since it has a lowest temperature, unless a later

NON-NOREUSE access increases its temperature. Here we do not

explicitly immediately free the page after access, this is for three

reasons, the first is the semantics of NOREUSE differs from DONTNEED,

 NOREUSE does not mean the page should be dropped  immediately; the

second is synchronously freeing the page will more or less slow down

the read performance; And the last, a near-future reference of the page

by other applications will have a chance to hit in the cache.

(b) If a page is accessed before, in other words, it is active or

referenced, then it may belong to the work set of other applications,

and will very likely be accessed again. NOREUSE just makes a silent

access, without changing any status of the page.

Another assumption is that file wide NOREUSE is enough to capture most

 of the usages, the fine granularity of interval-level NOREUSE is not

desirable given its rare use and its implementation complexity. So this

results in the following simple NOREUSE implementation,

(1) Introduce a new fmode FMODE_NOREUSE, set it on when calling

fadvise(NOREUSE)

So when will this flag be cleared? Do you need clear it while setting 
FMODE_RANDOM, FMODE_NORMAL, FMODE_SEQ etc, like 
https://lkml.org/lkml/2012/2/11/13 does?

(2) do_generic_file_read():

From:

if (prev_index != index || offset != prev_offset)

    mark_page_accessed(page);

To:

if ((prev_index != index || offset != prev_offset) && !(filp->f_mode &

FMODE_NOREUSE))

    mark_page_accessed(page);

    There are no more than ten LOC to go.

Cheers,

Li Wang