On Mon, Oct 19, 2015 at 12:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Your assumption doesn't match what I've seen (in high energy physics > (HEP)). The implicit hint you describe is much more apparent when > clients use object storage APIs like S3 or one of the oodles of > network storage systems we use in high energy physics. But NFS-like > shared filesystems are different. This is where we'll put > applications, libraries, configurations, configuration _data_ -- all > things which indeed _are_ likely to be re-used by the same client many > times. Consider these use-cases: a physicist is developing an analysis > which is linked against 100's of headers in CephFS, recompiling many > times, and also 100's of other users doing the same with the same > headers; or a batch processing node is running the same data analysis > code (hundreds/thousands of libraries in CephFS) on different input > files. > > Files are re-accessed so often in HEP that we developed a new > immutable-only, cache-forever filesystem for application distribution > (CVMFS). And in places where we use OpenAFS we make use of readonly > replicas to ensure that clients can cache as often as possible. You've caught me being a bit optimistic:-) In my idealistic version of reality, people would only use shared filesystems for truly shared data, rather than things like headers and libraries, but in the real world it isn't so (you are correct). It's a difficult situation: for example, I remember a case where someone wanted to improve small file open performance on Lustre, because they were putting their python site-packages directory in Lustre, and then starting thousands of processes that all wanted to scan it at the same time on startup. The right answer was "please don't do that! Just use a local copy of site-packages on each node, or put it on a small SSD filer", but in practice it is much more convenient for people to have a single global filesystem. Other examples spring to mind: * Home directories where someone's browser cache ends up on a triply-redundant distributed filesystem (it's completely disposable data! This is so wasteful!) * Small file create workloads from compilations (put your .o files in /tmp, not on a million dollar storage cluster!) These are arguably workloads that just shouldn't be on a distributed filesystem to begin with, but unfortunately developers do not always get to choose the workloads that people will run :-) In the future, there could be scope for doing interesting things with layouts to support some radically different policies, e.g. having "read replica" directories that do really coarse-grained caching and rely on a global broadcast to do invalidation. The trouble is that these things are a lot of work to implement, and they still rely on the user to remember to set the right flags on the right directories. It would be pretty interesting though to have e.g. an intern spend some time coming up with a caching policy that worked super-well for your use case, so that we had a better idea of how much work it would really be. A project like this could be something like taking the CVMFS/OpenAFS behaviours that you like, and building them into CephFS as optional modes. John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com