Re: Improving real world performance by moving files closer to their target workloads

Gordan Bobic <gordan@xxxxxxxxxx> · Fri, 16 May 2008 19:28:44 +0100

Derek Price wrote:

The DLM that GFS uses must already take this into account since it 
appears to work just fine, and the GPL'd code for that DLM was 
officially added to the Linux kernel with release 2.6.19, according to 
Wikipedia.  Not sure how portable that would be, but the source is 
available...

I'm not sure how portable that would be. A nice thing about GlusterFS is 
that the only requirement if FUSE, which means it'll also work on Solaris.

If some HA and fault-tolerant DHT implementation exists that already 
handles atomic hash inserts with recognizable failures for keys that 
already exist, then perhaps that could take the place of DLM's quorum 
model, but I think any algorithm that requires contacting all nodes 
will prove to be a bad idea in the end.

Not all nodes - only the nodes that contain a certain file. A single 
ping broadcast to find out who has a copy of the file should prove to 
be of insignifficant bandwidth overheat compared to actual file 
transfers, unless you are dealing with a lot of files that are 
signifficantly smaller than a network packet.

My point was that, as I understood your algorithm, a client would not 
know which nodes contained a certain file until all nodes had been 
contacted.  So, while the actual bandwidth, even to consult thousands of
nodes, might be small relative to file transfer bandwidth, the client 
can't assume it has a complete answer until it gets all the replies, 
meaning requests to downed nodes have timed out.

I agree that waiting for all nodes could be an issue in case of downed 
nodes, and I concur that quorum would be a good work-around.

Broadcasting a single packet (should easily fit into a single 1500 byte 
ethernet frame) so all nodes isn't _hugely_ expensive.

Multicast is usually UDP, so there's no TCP timeouts/retries to contend 
with. It wouldn't matter if some nodes are down - we can act as soon as 
we have answers from (n/2)+1 nodes, assuming in the case of requesting a 
file that isn't local, that one of those peers has the file.

Meaning that if you 
assume that at least one node will always be down, then the minimum time 
to locate a node with the most recent copy of the file (and thus the 
minimum time to begin any read) is always the timeout attached waiting 
for the ping reply.

There are ways around that. Flag a node as being out of the cluster when 
quorum decides it is unresponsive, and fence it.

Having the entire quorum aware of which version of each file is the most 
recent and where to find the file avoids this problem, again, until just 
less than half the nodes become unreachable.

There should, in theory, be only one version of the file in the entire 
cluster. If there isn't, then the AFR auto-heal should be invoked to see 
to it that there is only one. The important thing is to know which nodes 
have a copy of the file.

I might optimize the expunge algorithm slightly by having nodes with 
low loads volunteer to copy files that otherwise couldn't be expunged 
from a node.  Better yet, perhaps, would be a background process that 
runs on lightly loaded nodes and tries to create additional redundant 
copies at some configurable tolerance beyond the "minimum # of 
copies" threshold.

Not just lightly loaded nodes, but more importantly, nodes with most 
free space available. :)

Yes, the algorithm to detect "loading" should probably consider as many 
resource constraints as appears practical.

Load in terms of performance is a non-critical optimization. Space 
requirements being met is a mandatory requirement. :)

For file delta writes, an AFR type mechanism could be used to send 
the deltas to all the nodes that have the file. This could all get 
quite tricky, because it might require a separate multicast group to 
be set up for up to every node combination subset, in order to keep 
the network bandwidth down (or you'd just end up broadcasting to all 
nodes, which means things wouldn't scale as switches should, it'd be 
more like using hubs).

This would potentially have the problem that there is only 24 bits 
of IP multicast address space, but that should provide enough groups 
with sensible redundancy levels to cover all node combinations. This 
may or may not be way OTT complicated, though. There is probably a 
simpler and more sane solution.

I'm not sure what overhead is involved in creating multicast groups, 
but they would only be required for files currently locked for write, 
so perhaps creating and discarding the multicast groups could be done 
in conjunction with creation and release of write locks.

Sure, these could be dynamic, but setup and teardown might cause 
enough overhead that you might as well be broadcasting all the locks 
and writes, and just expect the affected nodes to pick those out of 
the air and act on them.

It's also possible that you could reduce the complexity of this 
problem by simply discarding as many copies down to as close to the 
minimum # as other nodes will allow, on write.  However, I think that 
might reduce some of the performance benefits this design otherwise 
gives each node.

Also remember that the broadcasts or multicasts would only actually be 
useful for locks and file discovery. The actual read file transfer 
would be point-to-point and writes would be distributed to only the 
subset of nodes that are currently caching the files.

Read would be point-to-point (perhaps multi-point to point for implicit 
read striping across all known valid copies?), but it could still be 
useful to use multi-cast for write, especially if the redundant copies 
were behind a different switch than the node accepting the write.  So 
multi-cast setup could happen when a server obtained a write lock, and 
teardown would be delayed until synchronization of redundant copies had 
completed.

Possibly, but if the number of possible node connections could be 
enumerated WRT given number of nodes and minimum required redundancy, 
setting them up statically and using a hash-lookup would probably be 
quicker, as it wouldn't require constant setups/teardowns.

We have 2^24 possible multicast "channels" (addresses).
Number of possible ways to pick k nodes out of n (files being 
k-redundant) is
n! / k! (n-k)!

Whether these constraints would allow for sufficiently large clusters, I 
don't know.

There would need to be special handling of a case where a node 
accepting a big write is running out of space as a consequence and 
something has to be dropped. Obviously, none of the currently open 
files can be discarded, so there would need to be some kind of an 
auxiliary process that would make a node request a "volunpeer" (pun 
intended) to take over a file that it needs to flush out, if 
discarting it would bring the redundancy below the required threshold.

I think this could be worked into the normal expunge algorithm with a 
property like: "ANY request to expunge a file that reduces the file 
count below the redundancy threshold will ALWAYS generate a volunpeer IF 
at least one node exists with the disk space available".

Yes. Failing that, we could try the next LRU file.

It wouldn't require any special casing - the needed space will always 
become available upon expunge if space for the migrating file exists 
anywhere on the network.  If all the files are expunged, or they can't 
be even with this property of expunge, and the local disk still fills 
up, then I think it would be reasonable for the FS to return a disk full 
error.

Agreed.

Gordan