Re: CephFS and page cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 10/19/2015 12:34 PM, John Spray wrote:
On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:
On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version
4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs
on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.
Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.

Mount options are defaults,noatime,_netdev (+ extra options for the
kernel
client). Is there an option to keep data in page cache just like any
other
filesystem?
So far there is no option to do that. Later, we may add an option to
keep the cached data for a few seconds.

This renders CephFS useless for almost any HPC cluster application. And
keeping data for a few seconds is not a solution in most cases.
While I appreciate your frustration, that isn't an accurate statement.
For example, many physics HPC workloads use a network filesystem for
snapshotting their progress, where they dump their computed dataset at
regular intervals.  In these instances, having a cache of the data in
the pagecache is rarely if ever useful.
I completely agree. HPC workloads are different depending on your field, and even within a certain field the workloads may vary. The examples mentioned in another mail are just that. Examples. We also have other applications and other workloads. Traditional HPC cluster used to be isolated with respect to both compute nodes and storage; access was possible via a head node and maybe some NFS server. In our setup compute and storage are more integrated into the user's setup. I think the traditional model is becoming extinct in our field, given all the new developments in the last 15 years.


Moreover, in the general case of a shared filesystem with many nodes,
it is not to be assumed that the same client will be accessing the
same data repeatedly: there is an implicit hint in the use of a shared
filesystem that applications are likely to want to access that data
from different nodes, rather than the same node repeatedly.  Clearly
that is by no means true in all cases, but I think you may be
overestimating the generality of your own workload (not that we don't
want to make it faster for you)
As mentioned above, CephFS is not restricted to our cluster hosts. It is also available on interactive compute machines and even on desktops. And on this machines users expect data to be present in the cache if they want to start a computation a second time, e.g. after adjusting some parameters. I don't mind file access being slow on the batch machine. But our users do mind slow access on their day-to-day work.

CephFS supports capabilities to manages access to objects, enforce
consistency of data etc. IMHO a sane way to handle the page cache is use a
capability to inform the mds about caches objects; as long as no other
client claims write access to an object or its metadata, the cache copy is
considered consistent. Upon write access the client should drop the
capability (and thus remove the object from the page cache). If another
process tries to access a cache object with intact 'cache' capability, it
may be promoted to read/write capability.
This is essentially what we already do, except that we pro-actively
drop the capability when files are closed, rather than keeping it
around on the client in case its needed again.

Having those caps linger on a client is a tradeoff:
  * while it makes subsequent cached reads from the original client
nice and fast, it adds latency for any other client that wants to open
the file.
I assume the same is also true with the current situation, if the file is already opened by another client.
  * It also adds latency for the original client when it wants to open
many other files, because it will have to wait for the original file's
capabilities to be given up before it has room in its metadata cache
to open other files.
  * it also creates confusion if someone opens a big file, then closes
it, then wonders why their ceph-fuse process is still sitting on gigs
of memory
I agree on that. ceph-fuse processes already become way too large in my opinion:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 902 root 20 0 3045056 1.680g 4328 S 0.0 21.5 338:23.78 ceph-fuse

(and that's just a web server with some perl cgi stuff....)

But the data itself should be stored in the page cache (dunno whether a fuse process can actually push data to the page cache).

Further, as Zheng pointed out, the design of cephfs requires that
whenever a client has capabilities on a file, it must also be in cache
on the MDSs.  Because there are many more clients than MDSs, clients
keeping comparatively modest numbers of capabilities can cause an much
more significant increase in the burden on the MDSs.  Even if this is
within the MDS cache limit, it still has the downside that it prevents
the MDS from caching other metadata that another client might want to
be using.
A possible solution is a kind of disconnected cache without caps on the MDS. If the client has a way to validate a rados object in the cache it could avoid reading it from the OSDs. The MDS would not be involved at all in the cache management.

So: the key thing to realise is that caching behaviour is full of
tradeoffs, and this is really something that needs to be tunable, so
that it can be adapted to the differing needs of different workloads.
Having an optional "hold onto caps for N seconds after file close"
sounds like it would be the right tunable for your use case, right?
It would definitely help in many cases, right. I also agree that completely revamping the cephfs client to make "proper" use of the page cache is a complex task of its own. Maybe you can put it on the TODO list at a low priority rank.

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux