On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote: > On 25/10/2019 00:35, Dave Chinner wrote: > > On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote: > > This isn't a theoretical problem - this is exactly the race > > condition that lead us to disabling the flag in the first place. > > There is no serialisation between the read and write parts of the > > page fault iand the filesystem changing the DAX flag and ops vector, > > and so fixing this problem requires hold yet more locks in the > > filesystem path to completely lock out page fault processing on the > > inode's mapping. > > > > Again sorry that I do not explain very good. > > Already on the read fault we populate the xarray, On a write fault we can have an empty xarray slot so the write fault needs to both populate the xarray slot (read fault) and process the write fault. > My point was that if I want to set the DAX mode I must enforce that > there are no other parallel users on my inode. The check that the > xarray is empty is my convoluted way to check that there are no other > users except me. If xarray is not empty I bail out with EBUISY Checking the xarray being empty is racy. The moment you drop the mapping lock, the page fault can populate a slot in the mapping that you just checked was empty. And then you swap the aops between the population and the ->page-mkwrite() call in the page fault that is running, and things go boom. Unless there's something new in the page fault path that nobody has noticed in the past couple of years, this TOCTOU race hasn't been solved.... > Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo The inode is instatiated before the rename is run, so it's set up with it's old dir config, not the new one. So this ends up with the same problem of haivng to change the S_DAX flag and aops vector dynamically on rename. Same problem, not a solution. > to have an effective change. In hard links the first one at iget time before populating > the inode cache takes affect. If something like a find or backup program brings the inode into cache, the app may not even get the behaviour it wants, and it can't change it until the inode is evicted from cache, which may be never. Nobody wants implicit/random/uncontrollable/unchangeable behaviour like this. > (And never change the flag on the fly) > (Just brain storming here) We went over all this ground when we disabled the flag in the first place. We disabled the flag because we couldn't come up with a sane way to flip the ops vector short of tracking the number of aops calls in progress at any given time. i.e. reference counting the aops structure, but that's hard to do with a const ops structure, and so it got disabled rather than allowing users to crash kernels.... Cheers, -Dave. -- Dave Chinner david@xxxxxxxxxxxxx