Luke McGregor wrote:
We are currently experimenting with running GLuster over the nodes in the cluster to produce a single large filesystem. For my Honors research project ive been asked to look into making some improvements to GLuster to try to improve performance by moving the files within the GLusterFS closer to the node which is accessing the file. What i was wondering is basically how hard would it be to write code to modify the metadata so that when a file is accessed it is then moved to the node which it is accessed from and its location is updated in the metadata.
So, you want a unify/AFR hybrid translator that keeps track of what nodes use what files most often, and migrate the file to that node? Perhaps a probabalistic local caching approach would do well with this. When a node accesses a file, there is a chance that it will replicate the file to local storage. If a node accesses a file repeatedly, the cumulative chance approaches unity. The problem is that you need some way of ensuring that files don't exist on more than XYZ nodes, and that when the store fills up, the file that gets dropped exists somewhere else, when you are dropping the least recently used file from a node.
Interesting enough idea, but I'm not sure if the book-keeping overheads would be overcome by speed benefits, especially on a fast network. You'd also not be able to route requests for a particular file easily, which might end up meaning a broadcast request to all nodes to establish who has the file available.
I suspect that designing an algorithm that does all this with sufficiently little overhead to keep you ahead in performance will be the most difficult part, not writing a GlusterFS plugin. You are almost looking at a variant of a probabalistically cached distributed hash table network, only without using hashes for routing (which makes it more difficult).
I'd _LOVE_ to see this done, though, it sounds like an awesome project. :) Gordan