On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote: > On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond > <Trond.Myklebust@xxxxxxxxxx> wrote: > > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: > >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond > >> <Trond.Myklebust@xxxxxxxxxx> wrote: > >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: > >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust > >> >> <Trond.Myklebust@xxxxxxxxxx> wrote: > >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > >> >> > disconnected data server) we've been sending layoutreturn calls > >> >> > while there is potentially still outstanding I/O to the data > >> >> > servers. The reason we do this is to avoid races between replayed > >> >> > writes to the MDS and the original writes to the DS. > >> >> > > >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can > >> >> > be triggered because it assumes that we would never call > >> >> > layoutreturn without knowing that all I/O to the DS is > >> >> > finished. The fix is to remove the BUG_ON() now that the > >> >> > assumptions behind the test are obsolete. > >> >> > > >> >> Isn't MDS supposed to recall the layout if races are possible between > >> >> outstanding write-to-DS and write-through-MDS? > >> > > >> > Where do you read that in RFC5661? > >> > > >> That's my (maybe mis-)understanding of how server works... But looking > >> at rfc5661 section 18.44.3. layoutreturn implementation. > >> " > >> After this call, > >> the client MUST NOT use the returned layout(s) and the associated > >> storage protocol to access the file data. > >> " > >> And given commit 0a57cdac3f, client is using the layout even after > >> layoutreturn, which IMHO is a violation of rfc5661. > > > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > > data server that is not responding. It isn't attempting to use the > > layout after the layoutreturn: the whole point is that we are attempting > > write-through-MDS after the attempt to write through the DS timed out. > > > But it is RFC violation that there is in-flight DS IO when client > sends layoutreturn, right? Not just in-flight, client is well possible > to send IO to DS _after_ layoutreturn because some thread can hold > lseg reference and not yet send IO. Once the write has been sent, how do you know that it is no longer 'in-flight' unless the DS responds? > >> >> And it causes data corruption for blocklayout if client returns layout > >> >> while there is in-flight disk IO... > >> > > >> > Then it needs to turn off fast failover to write-through-MDS. > >> > > >> If you still consider it following rfc5661, I'd choose to disable > >> layoutreturn in before write-through-MDS for blocklayout, by adding > >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' > >> PNFS_LAYOUTRET_ON_SETATTR. > > > > I don't see how that will prevent corruption. > > > > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the > > layoutreturn to communicate to the MDS that the DS is timing out via an > > error code (like the object layout has done all the time). How can you > > reconcile that change with a flag such as the one you propose? > I just intend to use the flag to disable layoutreturn in > pnfs_ld_write_done. block extents are data access permissions per > rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block > layout works correctly because server can decide if there is data > access race and if there is, MDS can recall the layout from client > before applying the MDS writes. > > Sorin's proposed error code is just a client indication to server that > there is disk access error. It is not intended to solve the data race > between write-through-MDS and write-through-DS. Then how do you solve that race on a block device? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥