local caching of file across global cluster

bfoster at redhat.com (Brian Foster) · Fri, 31 May 2013 16:37:19 -0400

On 05/31/2013 11:05 AM, Jay Vyas wrote:
> Is there any value / way to tell all the gluster nodes to make a file
> highly available, potentially at the cost of consistency (i.e. forget about
> locks for all files named XXXX and cache them in local disk)?
> 
> Scenario: Imagine I have a workflow of processing 1 million files, and I
> want to compare all 1 billion files to all the words in , say, a set of ten
> files, each of which are 10MB.
> 
> It would be easy to cache the ten files (100MB of data) on every local
> gluster node.   Or even in memory for that matter.
> 

I'm assuming by gluster node you're referring to a gluster client. Given
that, fuse already does this kind of read-only caching. The caveat to be
aware of is the default behavior has an invalidate on open heuristic.

So if your implementation will involve repeated open() calls for your
10x10MB files (i.e., running a script for every source file you're
checking against), you could be repeatedly reading/caching and flushing
the data you want to retain. In that case, you might want to try the
--fopen-keep-cache glusterfs (mount) option to bypass said behavior.

I suppose the subsequent question is whether the reads of the 1 billion
files push out that other 100MB, but I _think_ this is something the VM
should get right over time (i.e., via repeated accesses of that 100MB
set). That's probably something that warrants experimentation to verify
though.

Brian

> Admittedly...Im not an expert on disk caching so, maybe, this is already
> done using heuristics for us... and its just a matter of time for
> FUSE/Underlying filesystem/Gluster mount to figure out that a file is
> important before it starts caching it in some magical sort of way.
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>