Re: Can I bring a development idea to Dev's attention?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 On 24/09/2010 08:25, Shehjar Tikoo wrote:
Thanks, we have similar locking improvements in mind but cannot promise a date when these will be available. Some of the challenges that we'll need to think about are how to map any such locking scheme to standard locking behaviour for posix/nfsv3/v4/cifs.

Hi, thanks for replying

Whilst I can see that there is some optimisation to be had by combining brick level locking with filesystem level locking, I just want to clarify that my proposal was really about intra-brick locks, and not really about the toplevel filesystem level locking?

Just to clarify (and apologies if I'm trying to teach filesystem experts really obvious stuff...)

- The goal of any fileserver is to take async requests from lots of clients and arrange to serialise that access - Up until recently such fileservers have been on a single machine, but involving multiple async clients connecting - Even in a single server solution the bottleneck becomes that each client cannot cache *any* data since it's not known if the server copy has changed since we accessed it (even a microsecond earlier) - The solution which has become popular (see CIFS, NFSV4 (?), GFS2, etc) was to offer clients an "optimistic lock", ie the client can acquire a token which while it's held means that it can cache data locked by that token and even offer writeback optimisations on that data (obviously subject to whatever the application tolerates for unsync'ed data) - This "optimistic lock" means that we effectively push the file locking to the client, hence once a lock is acquired then further access by the client is no longer bounded by the network access latency, under many circumstances this leads to massive speedups - Clearly when a second client comes along and demands access to the same data then we need a process to break the lock and inform the first client that they need to reacquire the lock (or revert to a kind of "write-through" access system while waiting)

So this process clearly benefits situations where there is serialised access by single clients at a time. Excluding databases however, this access pattern seems quite common for lots of applications

So with regards to Gluster I would see that we need this same type of locking implemented at the brick level. Hence if you re-read the description above, then each *gluster server* would be the possible clients (think of the lower level being bricks talking to each other, and the upper level being clients talking to bricks). ie yes, posix locking needs to serialise access to every end client that connects to every brick, but we can also benefit from locking to serialise access between bricks (if 3,000 clients hammer one brick for a single file, then we care that our single brick is allowed to read/write that file freely because it informed the other bricks that it now holds a lock, it's a separate problem to serialise all the clients talking to the one brick)

So compared with traditional fileservers we actually need two levels of locking to serialise access. At one level we need to serialise clients access to the filesystem, and lower down we need to serialise access between bricks

I think an alternative way of looking (and perhaps implementing) the situation could be something like:

- Consider two bricks with files replicated between them
- Client 1 accesses Brick 1 and requests File A
- Brick 1 contacts the other replicas and requests to become the "master replica" of that file. All future accesses to that file must now go through only Brick 1 while it remains in that "role" - If Client 2 accesses Brick 1 and tries to do something with File A, then the normal filesystem locking must arrange for serialisation between Client 1 and Client 2, however, Brick 1 need not contact any other brick and there is no network latency penalty serving that file to Client 2 (obviously at some point one client will write data and we need to sync that, but read access incurs no network access)

- OK, now the trick is what happens when Client 3 accesses Brick 2 and requests File A... Somehow we need to wrest control back from Brick 1 and inform it that it's no longer the "master". A really simple solution to this (at least conceptually) is to proxy all access requests from Brick 2 back to Brick 1. This satisfies our requirement that accesses are serialised across bricks and effectively there is still a "master" brick remaining in control. - We can see that this setup is conceptually similar to having a traditional lock server arbitrating brick access to a given file, but in example above we have implemented a distributed lock server, the lock server effectively becoming the same server as what we hope is the "hot server", so that we aren't incuring network latency to contact the lock server all the time. - A further improvement would clearly be to have some kind of process where the "master brick" can move about, ie in the case above if Client 3 starts to bash away at Brick 2 for File A, then Brick 2 is migrated to become the "master" and hold the lock, now any access through Brick 1 must effectively proxy requests back to Brick 2 or re-acquire it's lock (ie become the master)

OK, so the above is a very simple example of optimistic locking and could be trivially implemented using an external lock server which tracks which brick currently holds the lock for a given file (ie can read/write freely without first checking if other bricks have modified the file). A given brick which doesn't hold a lock on a file must first do kind of what it does already and contact the lock server to see if another brick holds the lock. If not it can acquire the lock itself. If the lock is held elsewhere we either need to break the lock (or proxy access requests to the server holding the lock).

Really this is not so different to what is there today, but it's simply an efficiency improvement because we don't need to touch *every* brick for *every* file access, instead we make some network requests on first access to a file and then can continue to touch that file for a period afterwards without needing further network access with other bricks

However, whilst some kind of implementation of the above could offer a huge performance speedup for many of the situations which come up on the mailing list, the issue is that the lock server becomes a) a bottleneck and b) point of failure. So the chain of thought almost certainly goes something like:

- Make the gluster bricks become the lock servers, ie they negotiate amongst themselves. Really this is roughly what happens right now, only it's on every access, rather than access being "sticky" once acquired - Now analyse all the corner cases that bricks go down holding locks, or get segmented while holding/acquiring locks and discover some tricky issues...

Paxos seems like a clever way of dealing with the locking going distributed, yet not necessarily having a 100% consistent view of who owns which lock. By introducing a voting method it can show robustness in the face of failed machines and new machines can be added without needing to store reliable state information (or at least this is true with the improvements described in the articles)



Does that make sense? Apologies if the above is long winded, but the point is really that the performance improvements come from pushing locks between bricks, and probably this is distinct from client level locking such as nfs/cifs/posix, etc locking

For advanced cluster filesystems such as GFS2, the general "optimistic locking" technique appears to show massive speed improvements (for many access patterns) and it's also likely to do so in Gluster. Really my original email jumped two steps and suggested an improved form of distributed locking, which itself could be used as the actual implementation, but other forms of distributed locking between bricks would be highly desirable also.

Thanks for listening

Ed W



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux