On Mon, May 2, 2016 at 4:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote: >> Dave Chinner <david@xxxxxxxxxxxxx> writes: >> >> > On Mon, Apr 25, 2016 at 11:53:13PM +0000, Verma, Vishal L wrote: >> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote: >> > You're assuming that only the DAX aware application accesses it's >> > files. users, backup programs, data replicators, fileystem >> > re-organisers (e.g. defragmenters) etc all may access the files and >> > they may throw errors. What then? >> >> I'm not sure how this is any different from regular storage. If an >> application gets EIO, it's up to the app to decide what to do with that. > > Sure - they'll fail. But the question I'm asking is that if the > application that owns the data is supposed to do error recovery, > what happens when a 3rd party application hits an error? If that > consumes the error, the the app that owns the data won't ever get a > chance to correct the error. > > This is a minefield - a 3rd party app that swallows and clears DAX > based IO errors is a data corruption vector. can yo imagine if > *grep* did this? The model that is being promoted here effectively > allows this sort of behaviour - I don't really think we > should be architecting an error recovery strategy that has the > capability to go this wrong.... Since when does grep write to a file on error? > >> >> > Where does the application find the data that was lost to be able to >> >> > rewrite it? >> >> >> >> The data that was lost is gone -- this assumes the application has some >> >> ability to recover using a journal/log or other redundancy - yes, at the >> >> application layer. If it doesn't have this sort of capability, the only >> >> option is to restore files from a backup/mirror. >> > >> > So the architecture has a built in assumption that only userspace >> > can handle data loss? >> >> Remember that the proposed programming model completely bypasses the >> kernel, so yes, it is expected that user-space will have to deal with >> the problem. > > No, it doesn't completely bypass the kernel - the kernel is the > infrastructure that catches the errors in the first place, and it > owns and controls all the metadata that corresponds to the physical > location of that error. The only thing the kernel doesn't own is the > *contents* of that location. > >> > What about filesytsems like NOVA, that use log structured design to >> > provide DAX w/ update atomicity and can potentially also provide >> > redundancy/repair through the same mechanisms? Won't pmem native >> > filesystems with built in data protection features like this remove >> > the need for adding all this to userspace applications? >> >> I don't think we'll /only/ support NOVA for pmem. So we'll have to deal >> with this for existing file systems, right? > > Yes, but that misses my point that it seems that the design is only > focussed on userspace and existing filesystems and there is no > consideration of kernel side functionality that could do transparent > recovery.... > >> > If so, shouldn't that be the focus of development rahter than >> > placing the burden on userspace apps to handle storage repair >> > situations? >> >> It really depends on the programming model. In the model Vishal is >> talking about, either the applications themselves or the libraries they >> link to are expected to implement the redundancies where necessary. > > IOWs, filesystems no longer have any control over data integrity. > Yet it's the filesystem developers who will still be responsible for > data integrity and when the filesystem has a data coruption event > we'll get blamed and the filesystem gets a bad name, even though > it's entirely the applications fault. We've seen this time and time > again - application developers cannot be trusted to guarantee data > integrity. yes, some apps will be fine, but do you really expect > application devs that refuse to use fsync because it's too slow are > going to have a different approach to integrity when it comes to > DAX? Yes, completely agree. The applications that will implement competent error recovery with these mechanisms will be vanishingly small, and there is definite room for a kernel data-redundancy solution that builds on these patches. > >> >> > There's an implicit assumption that applications will keep redundant >> >> > copies of their data at the /application layer/ and be able to >> >> > automatically repair it? >> >> That's one way to do things. It really depends on the application what >> it will do for recovery. >> >> >> > And then there's the implicit assumption that it will unlink and >> >> > free the entire file before writing a new copy >> >> I think Vishal was referring to restoring from backup. cp itself will >> truncate the file before overwriting, iirc. > > Which version of cp? what happens if they use --sparse and the error > is in a zeroed region? There's so many assumptions about undefined userspace > environment, application and user behaviour being made here, and > it's all being handwaved away. > > I'm asking for this to be defined, demonstrated and documented as a > working model that cannot be abused and doesn't have holes the size > of trucks in it, not handwaving... You lost me... how are these patches abusing the existing semantics of -EIO and write to clear? >> >> To summarize, the two cases we want to handle are: >> >> 1. Application has inbuilt recovery: >> >> - hits badblock >> >> - figures out it is able to recover the data >> >> - handles SIGBUS or EIO >> >> - does a (sector aligned) write() to restore the data >> > >> > The "figures out" step here is where >95% of the work we'd have to >> > do is. And that's in filesystem and block layer code, not >> > userspace, and userspace can't do that work in a signal handler. >> > And it can still fall down to the second case when the application >> > doesn't have another copy of the data somewhere. >> >> I read that "figures out" step as the application determining whether or >> not it had a redundant copy. > > Another undocumented assumption, that doesn't simplify what needs to > be done. Indeed, userspace can't do that until it is in SIGBUS > context, which tends to imply applications need to do a major amount > of work from within the signal handler.... > >> > FWIW, we don't have a DAX enabled filesystem that can do >> > reverse block mapping, so we're a year or two away from this being a >> > workable production solution from the filesystem perspective. And >> > AFAICT, it's not even on the roadmap for dm/md layers. >> >> Do we even need that? What if we added an FIEMAP flag for determining >> bad blocks. > > So you're assuming that the filesystem has been informed of the bad > blocks and has already marked the bad regions of the file in it's > extent list? > > How does that happen? What mechanism is used for the underlying > block device to inform the filesytem that theirs a bad LBA, and how > does the filesytem the map that to a path/file/offset with reverse > mapping? Or is there some other magic that hasn't been explained > happening here? In 4.5 we added this: commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f Author: Vishal Verma <vishal.l.verma@xxxxxxxxx> Date: Sat Jan 9 08:36:51 2016 -0800 block: Add badblock management for gendisks NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html