Re: Improving real world performance by moving files closer to their target workloads

gordan@xxxxxxxxxx · Wed, 21 May 2008 16:34:02 +0100 (BST)

On Wed, 21 May 2008, Luke McGregor wrote:

I think i misunderstood the way you were proposing implementing a quorum. I
thought there was no state data stored.

What exactly are you referring to as "state data"?

Let me just confirm how you were proposing to implement this.

[numbers added for later reference]

1
- (client) A node requests a write to a specific file and broadcasts this to
the network

2
- (server) the server checks that no other nodes are claiming a lock on that
file and replies accordingly if file was lockable the server locks it

Which server would this be? IIRC, every lock would need be noted on at 
least 50%+1 server.

3
- (client) The node then waits for 50% + 1 node to respond and say that they
can write.

4
- (client) The node writes the file

5
- (client) The node broadcasts a file completed message.

6
- (server) Updates its locking database to free that file

Does this look correct?

I think so, but I am not entirely clear WRT whether you are talking about 
1 type of server or a seprate server "class" for lock servers. I don't 
believe separate lock/metadata servers were ever proposed.

I'm not entirely sure if it would be better to store a copy of all the 
locks on all the nodes, or whether the locks should only be stored on the 
nodes that have a copy of the file. The latter would make more sense, but 
it would make the "quorum" consist of only the nodes that have a copy of 
the file, not all the nodes in the cluster. The problem with this is that 
a list of all the nodes in the cluster is fixed (or can be reasonably 
fixed), whereas in a system where files are replicating/migrating/expiring 
toward the nodes that use them most heavily, maintianing the list of nodes 
that have any one file would become difficult.

If so i have a few questions.

Is there any information stored by nodes on who is writing the file?

At the moment?
Not sure about unify (presumed to be the node that has the file).
In AFR, the first server in the list is the "lock server".

In the proposed new migrating translator? I'd have thought it would be 
handled similar to unify, only with the lock information replicated to all 
the nodes storing the file, along with the file data itself. Details of 
who has the current file lock along with meta-information about that lock 
could be stored in xattrs (since that is what is already used for version 
metadata).

(if so
what happens when the lock fails? wont the above model lock the file on
nodes which have no current lock but not actually hold the lock ie node1
requests but node2 has lock wont some servers have granted lock to node1 and
have that info stored)

There are several ways this could be dealt with. In theory, all nodes 
that have the file should agree on the locks on that file (since that is 
replicated and acknowledged by all the nodes with that file). If this 
isn't the case, something went wrong. We can wait for the lock to 
expire (e.g. 100ms) and see if it gets refreshed. It shouldn't, or it 
means that somehow one server is getting updated without the change 
propagating to the other servers. The current AFR approach to resolving 
this is to clobber the out of sync files with new versions.

If this is not the case what happens if some servers
dont recieve the un-lock broadcast? wont they still think that the file is
locked and respond on that basis the next time they are in a quorum?

That's why locks need timeouts and refreshes to handle node death while it 
has a lock, along with acks for lock/unlock approvals for normal 
operation.

If we assume the simplest possible solution where there is at least one copy
of each file required, how would you identify a file which can be deleted on
the system without having to broadcast a query on every single file starting
from the oldest?

You can't. I don't think there is an easy work-around. In the case of a 
single node, this shouldn't cause a huge load on the network, but when the 
network starts getting full AND there is still a relatively high amount of 
migration required for optimization, the load would go up dramatically.

However - migration doesn't have to happen online. If there isn't enough 
space to migrate to local disk, you can do a remote open, and sort out the 
disk space freeing and possibly migration of the file to the local machine 
asynchronously.

Obviously i agree that distributed metadata is a really good thing to have
for scalability and reliability. However i am worried that the whole
broadcasting side of things is going to cause some huge problems in
implementing our migration project.

Unify already does something similar to find what node has the file in 
question.

im especially worried about how to solve
the old file problem.

Yes, that's not the easy one to solve, but as I said, if something needs 
expunging for space reasons, it could be done asynchronously. It should 
also, in theory, not become necessary until the network starts to get 
relatively full.

im also worried that every server in the network is
going to have to hold a fairly sizable set of metadata, this seems to be a
problem in terms of scaling.

The only metadata I can think of is the lock information, which can be 
stored in xattrs, the same as versioning for AFR. This hardly amounts to a 
lot of data, especially since the metadata would be stored on the same 
nodes as the files.

Gordan