Re: CephFS and page cache

John Spray <jspray@xxxxxxxxxx> · Mon, 19 Oct 2015 13:31:24 +0100

On Mon, Oct 19, 2015 at 12:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Your assumption doesn't match what I've seen (in high energy physics
> (HEP)). The implicit hint you describe is much more apparent when
> clients use object storage APIs like S3 or one of the oodles of
> network storage systems we use in high energy physics. But NFS-like
> shared filesystems are different. This is where we'll put
> applications, libraries, configurations, configuration _data_ -- all
> things which indeed _are_ likely to be re-used by the same client many
> times. Consider these use-cases: a physicist is developing an analysis
> which is linked against 100's of headers in CephFS, recompiling many
> times, and also 100's of other users doing the same with the same
> headers; or a batch processing node is running the same data analysis
> code (hundreds/thousands of libraries in CephFS) on different input
> files.
>
> Files are re-accessed so often in HEP that we developed a new
> immutable-only, cache-forever filesystem for application distribution
> (CVMFS). And in places where we use OpenAFS we make use of readonly
> replicas to ensure that clients can cache as often as possible.

You've caught me being a bit optimistic:-)  In my idealistic version
of reality, people would only use shared filesystems for truly shared
data, rather than things like headers and libraries, but in the real
world it isn't so (you are correct).

It's a difficult situation: for example, I remember a case where
someone wanted to improve small file open performance on Lustre,
because they were putting their python site-packages directory in
Lustre, and then starting thousands of processes that all wanted to
scan it at the same time on startup.  The right answer was "please
don't do that!  Just use a local copy of site-packages on each node,
or put it on a small SSD filer", but in practice it is much more
convenient for people to have a single global filesystem.

Other examples spring to mind:
 * Home directories where someone's browser cache ends up on a
triply-redundant distributed filesystem (it's completely disposable
data!  This is so wasteful!)
 * Small file create workloads from compilations (put your .o files in
/tmp, not on a million dollar storage cluster!)

These are arguably workloads that just shouldn't be on a distributed
filesystem to begin with, but unfortunately developers do not always
get to choose the workloads that people will run :-)

In the future, there could be scope for doing interesting things with
layouts to support some radically different policies, e.g. having
"read replica" directories that do really coarse-grained caching and
rely on a global broadcast to do invalidation.  The trouble is that
these things are a lot of work to implement, and they still rely on
the user to remember to set the right flags on the right directories.
It would be pretty interesting though to have e.g. an intern spend
some time coming up with a caching policy that worked super-well for
your use case, so that we had a better idea of how much work it would
really be.  A project like this could be something like taking the
CVMFS/OpenAFS behaviours that you like, and building them into CephFS
as optional modes.

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com