On Sun, Nov 16, 2008 at 2:48 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > On Fri, Nov 14, 2008 at 03:34:56PM -0500, Mike Snitzer wrote: >> Hello, >> >> I'd like to understand the state of Linux's NFSv4 server regarding the >> NFSv4 spec's _optional_ ordered blocking lock list implementation. >> Without something like the following patch isn't there still concern >> for NFSv4 clients being starved from ever getting a conflicting lock >> (local POSIX or lockd waiters would race to get it first)? > > Yes. I have patches that approach the problem by: > > - Defining a new type of lock type, the "provisional" lock, > which is just like a posix lock type, *except* that it > doesn't merge with other locks, and hence can still be cancelled > safely. > - Modifies the process of waking up waiters for a just-released > lock to make it a two-step process: > 1. Apply a "provisional" lock, if there are no > conflicts, and wake whoever was waiting for it. (If > there are still conflicts, put the lock on the new > list without waking anyone.) > 2. Allow the waiter to upgrade the provisional lock to a > real posix lock (or, alternatively, to cancel it). > - Take advantage of the above to implement fair queuing for v4, > by stretching out the gap between steps 1 and 2 up to a lease > period, thus allowing a lock that is available but that a > client has yet polled for to be temporarily represented by a > provisional lock. > > The thought was that we'd also solve a couple other problems along the > way, by: > > - Preventing thundering herd problems on posix locks with lots > of waiters. > - Increasing fairness of posix locking (even among local > lockers). > > But we weren't able to actually show any improvement for posix locks > with local waiters, and it's unclear whether anyone cares much about > posix lock fairness. > > So it's unclear whether it's worth doing the 2-step process above for > all posix lockers. So maybe the patches should be written to instead > implement provisional locks as an optional extra for use of the v4 > server. Thanks for the overview. I think that given how easy it is to starve a v4 client (see below) something needs to give. > A real-world test case (showing starvation of v4 clients) would be > interesting if anyone had one. I'm not sure what your definition of "real-world test case" is (so maybe the following is moot) but the attached program (written by a co-worker) can be used to easily illustrate the starvation of v4 clients. The program tests to see how long it takes to lock/unlock a file 1000 times. If ran locally on the nfs-server against an exported ext3 FS I get a "step time" of 11ms. Ran from a v3 client: ~390ms. Ran from a v4 client: ~430ms. ran simultaneously on the nfs-server and the v3 client; local=~30ms, v3=~440ms ran simultaneously on two v3 clients; both v3=~580ms ran simultaneously on the nfs-server and the two v3 clients; both v3=~580ms, but local ranges ~1500ms to ~9300ms ran simultaneously on two v4 clients; v4=~430ms but with frequent interleaved outliers ranging from ~1500ms to ~75000ms ran simultaneously on the nfs-server and the v4 client; local=~11ms, v4=STARVED ran simultaneously on the v3 and the v4 client; v3=~390ms, v4=STARVED FYI, "STARVED" above doesn't mean the v4 client _never_ acquires the lock. It eventually acquires the lock; albeit extremely rarely (e.g. after 5min) because v4 client polling is predisposed to lose the race with either the hyper-active v3 client or the local locker. Mike
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/time.h> #include <string.h> #include <errno.h> #define STEP 1000 #define CMD F_SETLKW int main(int argc, char **argv) { struct flock lc_rq = {F_WRLCK, SEEK_CUR, 0, 100, 0}; int i, ret = 0; char str[10]; unsigned long sleep_timeout = 0; struct timeval tv_start, tv_prev, tv_cur; int fd1 = open(argv[1], O_RDWR), fd2 = 0; off_t start = 0; gettimeofday(&tv_start, NULL); tv_cur = tv_prev = tv_start; for (i=0;;i++) { lc_rq.l_type = F_WRLCK; ret = fcntl(fd1, CMD, &lc_rq); lc_rq.l_type = F_UNLCK; ret = fcntl(fd1, CMD, &lc_rq); if (0 == i%STEP) { gettimeofday(&tv_cur, NULL); printf("i=%d tot time = %dms (step time= %dms)\n", i/STEP, 1000*(tv_cur.tv_sec - tv_start.tv_sec) + (tv_cur.tv_usec - tv_start.tv_usec)/1000, 1000*(tv_cur.tv_sec - tv_prev.tv_sec) + (tv_cur.tv_usec - tv_prev.tv_usec)/1000); tv_prev = tv_cur; } } return 0; }