Re: Improving real world performance by moving files closer to their target workloads

"Luke McGregor" <luke@xxxxxxxxxxxxxxx> · Mon, 19 May 2008 13:07:40 +1200

Hi everyone ive just had a read through the thread, thanks for all your
comments so far, has been really useful :)

Firstly im starting to think that maybe the best way forward is to somehow
centralise the metadata store (be that by distributing it or by having it
handled by a dedicated metadata server im not too sure). The reason im
thinking this may be the best way forward is that i think that any quorum
based approach will not be able to guarentee that any write is sucessful. if
a write occurs and the network is queried for a quorum before the node with
the latest copy has a chance to pass replicate its new data then there is
too much room for a quorum being reached without needing the approval of the
node with the latest copy. This could cause some serious problems especially
on a hevially accessed file. The problem would i believe be worsened as the
nodes which are hosting any hevially accessed file are the most likely to
not respond quickly to any kind of multicast.
Furthermore in the essence of performace you would want to act as soon as a
quorum was reached. this would effectivly mean that the nodes which made the
decision on the lock would be the lightest accessed nodes with the hevier
accessed nodes responding in slightly more time. I believe this would mean
that in order to be sure of a lock you would need to get a consesus from all
nodes. This would be unpractical.
Also looking for a file to delete to free space would be a really
inefficient proccess as a single request for space would potentially mean
querying the network for redundancy information on every other file stored.
this would not be practical.
I think a centralised metadata system would eliminate these problems as it
would be authoritive. A write shouldnt suceed without the central metadata
being updated and a lock shouldnt be granted without the central metadata
allowing it. This would also mean that old files would be invalidated by the
system centrally giving a side effect of allowing an easy rollback mechanism
until those files were deleted (if anybody ever wanted that feature). It
would also mean that freeing up space would be a reletivly simple operation.

With the plans that have been discussed earlier, what is the expected
timeframe on implementing the distributed namespace cache? And will this new
implementation be able to be extended to be able to achieve a system such as
i have described above? Will it be possible to host all of the information
that would be globally relevent on files in the distributed namespace cache?

In response to jordan

2) When the compute-nodes start to fill up, HSM migrates data from the
compute-nodes to a separate gluster pool that uses slower disks (RAID6 --
16x1TB in 3U), but with afr and/or striping depending on your needs, as well
as a compression filesystem (or at the gluster level if it gets implemented
soon). Deduplication would also come in handy here whenever that gets
implemented.

Just as an offside note i think your raid6 set will outperform your raid10
set even though raid6 is slower than raid10 and you may be using slower
drives, you still have 16 drives in the set which i believe will actually
give you faster perfromance than any 4 drive configuration.

Thanks again for all your replies, this is a very interesting discussion and
i think it will be very helpful for my research

Thanks
Luke McGregor