On Tue, May 26, 2015 at 10:17:05AM -0400, Jiffin Thottan wrote: > > > ----- Original Message ----- > From: "Vijay Bellur" <vbellur@xxxxxxxxxx> > To: "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx>, "Shyam" <srangana@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Friday, 22 May, 2015 9:49:15 AM > Subject: Re: Spurious regression: Checking on 3 test cases > > On 05/22/2015 07:13 AM, Krishnan Parthasarathi wrote: > > > >> Are the following tests in any spurious regression failure lists? or, > >> noticed by others? ... > >> 2) ./tests/basic/mount-nfs-auth.t > >> Run: > >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/9406/consoleFull > >> > > Right now there is no easy fix for the issue. It may require to > reconstruct entire netgroup/export structures used for this feature. Indeed, my suspicion is that the current structures for netgroups/exports and the auth_caching is not completely thread safe. The structures use a dict for gathering entries, and hash some of the contents of the entry as a key. This makes it quick to check if the entry has been cached. There also is a refresh thread that can read the exports and netgroups file from disk, and creates entries in the dict for the respective functionality. The problem (likely) occurs when this happens: Step | NFS-client | refresh thread ------+-------------------------------+--------------------------- | | 1 | mount request | | fetch entry from caching dict | | | 2 | | timeout expired | | read file from disk | | create and fill a new dict | | replace dict with new one | | 3 | | free old dict and entries | | 4 | try to use the cache entry | | SEGFAULT | | | Step 1 and 2 can happen at the same time, but 3 has to come after 1 and before 4. Step 4 is a very minimal usage of the cache entry, this makes hitting this problem very rare. Because the netgroups/exports cache entries are kept in a dict, there is a lot of type-casting going on. It is not trivial to understand how this works. My preference would be to modify the structures so that we can do without the dicts, but that is not straight forward either. I would like to have a refcount for the entries that are fetched from the dict. But that means that after type-casting the contents from the dict, there still is a window where the cache entry can get free'd (or --refcount'd). The next best thing to do, is adding a lock on the cache structure itself. Everywhere the cache is accessed, taking a read-lock would be needed. Adding an entry to the cache would require a write-lock. Looks like a decent amount of work, and needs careful checking to not miss any occurrences. This is currently what I think is the most suitable approach. Testing can probably be done by adding a delay in the mount path. Either a gdb script or systemtap that delays the execution of the thread that handles the mount. Other ideas are very much welcome :) Niels _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel