On Mon, Jul 9, 2018 at 2:10 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > On Mon, 2018-07-09 at 16:15 +0800, Eddie Horng wrote: >> 2018-07-09 14:30 GMT+08:00 Amir Goldstein <amir73il@xxxxxxxxx>: >> > I have no clue. >> > Is the leaked lock and crash on the client or the server? >> > If you can get an strace from the process that gets the Leaked message >> > maybe it will give us a clue to the sort of file descriptor of the leaked >> > file and how it was opened. >> > Alternatively print the inode numbers and file types of flock calls to see >> > where we have a mismatch. >> > >> > Thanks, >> > Amir. >> >> Both the leaked lock and crash are on the server. >> >> I can emulate one of the lock failure case with a reproducer run along with >> android building. The reproducer's behavior and result are very similar with >> out/.lock generated by android build to control only one build process can >> run on at the same time. In the first time (out/.lock is not exist), >> flock works but a >> "Leaked ..." message is supposed caused by it. After a round of build >> completed, do a second build, the out/.lock is now failed to be locked. >> The reproducer open and flock another file under out/ can reproduce the case. >> Can this scenario help us to debug? >> >> process 1: process 2: >> $ ~/flock/a.out /mnt/n/out/mylock >> flock succeed, press any key to continue... >> >> $ cd /mnt/n && make -j12 # (build android) >> close succeed >> $ ~/flock/a.out /mnt/n/out/mylock >> failed to lock file '/mnt/n/out/mylock': Resource temporarily unavailable >> close succeed >> >> reproducer: >> #include <stdio.h> >> #include <sys/types.h> >> #include <sys/stat.h> >> #include <fcntl.h> >> #include <unistd.h> >> #include <sys/file.h> >> #include <errno.h> >> #include <string.h> >> >> int main(int argc, void **argv) { >> char *filename=argv[1]; >> int fd = open(filename, O_RDWR|O_CREAT, 0666); >> int flock_result = flock(fd, LOCK_EX | LOCK_NB); >> int err; >> if (flock_result != 0) { >> printf("failed to lock file '%s': %s\n", filename, strerror(errno)); >> goto out; >> } >> printf("flock succeed, press any key to continue...\n"); >> getchar(); >> >> out: >> err = close(fd); >> if (err == 0) >> printf("close succeed\n"); >> else >> printf("failed to close %d: %s\n", fd, strerror(errno)); >> } >> > > This setup is pretty complicated. IIUC, you are exporting overlayfs via > knfsd and then using the NFS client's flock emulation to map flock locks > to POSIX ones. I think you probably want to simplify this reproducer a > bit. > > Is it possible to reproduce this on a setup that doesn't have overlayfs > involved, just to rule it in or out as a factor here? > > There are also a number of tracepoints in the posix locking code. It > might be interesting to turn on the ones for posix_lock_inode and > locks_remove_posix and and then run the reproducer to get a better idea > of what's happening to those locks. > Thanks for the suggestions Jeff. Eddie, This is NFS v4. Right? Do you wait until Android build completes before closing the first reproducer fd? I suspect you can replace the effect of Android build with drop_caches on the server. Jeff, Does knfsd hold a reference on the file/dentry/inode when a lock is taken? Assuming this is indeed a bug reproduced only with NFS+overlayfs it sounds like overlay decode file handle fails to return the same inode that knfd holds with the lock. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html