Re: CephFS and page cache

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 19 Oct 2015 13:52:28 +0200

On Mon, Oct 19, 2015 at 12:34 PM, John Spray <jspray@xxxxxxxxxx> wrote:
> On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> On 10/19/2015 05:27 AM, Yan, Zheng wrote:
>>>
>>> On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've noticed that CephFS (both ceph-fuse and kernel client in version
>>>> 4.2.3)
>>>> remove files from page cache as soon as they are not in use by a process
>>>> anymore.
>>>>
>>>> Is this intended behaviour? We use CephFS as a replacement for NFS in our
>>>> HPC cluster. It should serve large files which are read by multiple jobs
>>>> on
>>>> multiple hosts, so keeping them in the page cache over the duration of
>>>> several job invocations is crucial.
>>>
>>> Yes. MDS needs resource to track the cached data. We don't want MDS
>>> use too much resource.
>>>
>>>> Mount options are defaults,noatime,_netdev (+ extra options for the
>>>> kernel
>>>> client). Is there an option to keep data in page cache just like any
>>>> other
>>>> filesystem?
>>>
>>> So far there is no option to do that. Later, we may add an option to
>>> keep the cached data for a few seconds.
>>
>>
>> This renders CephFS useless for almost any HPC cluster application. And
>> keeping data for a few seconds is not a solution in most cases.
>
> While I appreciate your frustration, that isn't an accurate statement.
> For example, many physics HPC workloads use a network filesystem for
> snapshotting their progress, where they dump their computed dataset at
> regular intervals.  In these instances, having a cache of the data in
> the pagecache is rarely if ever useful.
>
> Moreover, in the general case of a shared filesystem with many nodes,
> it is not to be assumed that the same client will be accessing the
> same data repeatedly: there is an implicit hint in the use of a shared
> filesystem that applications are likely to want to access that data
> from different nodes, rather than the same node repeatedly.  Clearly
> that is by no means true in all cases, but I think you may be
> overestimating the generality of your own workload (not that we don't
> want to make it faster for you)
>

Your assumption doesn't match what I've seen (in high energy physics
(HEP)). The implicit hint you describe is much more apparent when
clients use object storage APIs like S3 or one of the oodles of
network storage systems we use in high energy physics. But NFS-like
shared filesystems are different. This is where we'll put
applications, libraries, configurations, configuration _data_ -- all
things which indeed _are_ likely to be re-used by the same client many
times. Consider these use-cases: a physicist is developing an analysis
which is linked against 100's of headers in CephFS, recompiling many
times, and also 100's of other users doing the same with the same
headers; or a batch processing node is running the same data analysis
code (hundreds/thousands of libraries in CephFS) on different input
files.

Files are re-accessed so often in HEP that we developed a new
immutable-only, cache-forever filesystem for application distribution
(CVMFS). And in places where we use OpenAFS we make use of readonly
replicas to ensure that clients can cache as often as possible.

>> CephFS supports capabilities to manages access to objects, enforce
>> consistency of data etc. IMHO a sane way to handle the page cache is use a
>> capability to inform the mds about caches objects; as long as no other
>> client claims write access to an object or its metadata, the cache copy is
>> considered consistent. Upon write access the client should drop the
>> capability (and thus remove the object from the page cache). If another
>> process tries to access a cache object with intact 'cache' capability, it
>> may be promoted to read/write capability.
>
> This is essentially what we already do, except that we pro-actively
> drop the capability when files are closed, rather than keeping it
> around on the client in case its needed again.
>
> Having those caps linger on a client is a tradeoff:
>  * while it makes subsequent cached reads from the original client
> nice and fast, it adds latency for any other client that wants to open
> the file.
>  * It also adds latency for the original client when it wants to open
> many other files, because it will have to wait for the original file's
> capabilities to be given up before it has room in its metadata cache
> to open other files.
>  * it also creates confusion if someone opens a big file, then closes
> it, then wonders why their ceph-fuse process is still sitting on gigs
> of memory
>
> Further, as Zheng pointed out, the design of cephfs requires that
> whenever a client has capabilities on a file, it must also be in cache
> on the MDSs.  Because there are many more clients than MDSs, clients
> keeping comparatively modest numbers of capabilities can cause an much
> more significant increase in the burden on the MDSs.  Even if this is
> within the MDS cache limit, it still has the downside that it prevents
> the MDS from caching other metadata that another client might want to
> be using.
>
> So: the key thing to realise is that caching behaviour is full of
> tradeoffs, and this is really something that needs to be tunable, so
> that it can be adapted to the differing needs of different workloads.
> Having an optional "hold onto caps for N seconds after file close"
> sounds like it would be the right tunable for your use case, right?
>

I think that would help. Caching is pretty essential so we'd buy more
MDS's and loads of RAM if CephFS became a central part of our
infrastructure.

But looking forward, if CephFS could support the immutable bit --
chattr +i <file> -- then maybe the MDS wouldn't need to track clients
who have such files cached. (Immutable files would be useful for other
reasons too, like archiving!)

-- 
Dan, CERN IT

> John
>
>> I haven't dug into the details of either capabilities or kernel page cache,
>> but the method described above should be very similar to the existing read
>> only capability. I don't know whether there's a kind of eviction callback in
>> the page cache that cephfs can use to update capabilities if an object is
>> removed from the page cache (e.g. due to memory pressure), but I'm pretty
>> sure that other filesystems like NFS also need to keep track of what's
>> cached.
>>
>> This approach will probably increase the resources for both MDS and cephfs
>> clients, but the benefits are obvious. For use cases with limited resource
>> the MDS may refuse the 'cache' capability to client to reduce the memory
>> footprint.
>>
>> Just my 2 ct and regards,
>>
>> Burkhard
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com