On Tue, May 03, 2016 at 10:28:15AM -0700, Dan Williams wrote: > On Mon, May 2, 2016 at 6:51 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote: > [..] > > Yes, I know, and it doesn't answer any of the questions I just > > asked. What you just told me is that there is something that is kept > > three levels of abstraction away from a filesystem. So: > > Ok, let's answer them. > > A lot of your questions seem to assume the filesystem has a leading > role to play with error recovery, that isn't the case with traditional > disk errors and we're not looking to change that situation. *cough* BTRFS New filesystems are mostly being designed with redundancy and recovery mechanisms built into them. Hence the high level /assumption/ that filesystems aren't going to play a significant role in error recovery for pmem storage is, well, somewhat disturbing.... > The > filesystem can help with forensics after an error escapes the kernel > and is communicated to userspace, but the ability to reverse map a > sector to a file is just a convenience to identify potential data > loss. So working out what file got corrupted in your terabytes of pmem storage is "just a convenience"? I suspect that a rather large percentage of admins will disagree with you on this. > For redundancy in the DAX case I can envision DAX-aware RAID that > makes the potential exposure to bad blocks smaller, but it will always > be the case that the array can be out-of-sync / degraded and has no > choice but to communicate the error to userspace. So, the answers > below address what we do when we are in that state, and include some > thoughts about follow-on enabling we can do at the DM/MD layer. > > > - What mechanism is to be used for the underlying block > > device to inform the filesytem that a new bad block was > > added to this list? > > The filesystem doesn't need this notification and doesn't get it today > from RAID. Why doesn't the filesystem need this notification? Just because we don't get it today from a RAID device does not mean we can't use it. Indeed, think about the btrfs scrub operation - it validates everything on it's individual block devices, and when it finds a problem (e.g. a data CRC error) it notifies a different layer in the btrfs code that goes and works out if it can repair the problem from redundant copies/parity/mirrors/etc. > It's handy for the bad block list to be available to > fs/dax.c and the block layer, but I don't see ext4/xfs having a role > to play with the list and certainly not care about "new error detected > events". That's very short-sighted. Just because ext4/xfs don't *currently* do this, it doesn't mean other filesystems (existing or new) can't make use of notifications, nor that ext4/XFS can't ever make use of it, either. > For a DM/MD driver it also does not need to know about new > errors because it will follow the traditional disk model where errors > are handled on access, or discovered and scrubbed during a periodic > array scan. > > That said, new errors may get added to the list by having the pmem > driver trigger a rescan of the device whenever a latent error is > discovered (i.e. memcpy_from_pmem() returns -EIO). The update of the > bad block list is asynchronous. We also have a task on the todo list > to allow the pmem rescan action to be triggered via sysfs. IOWs, the pmem driver won't report errors to anyone who can correct them until an access to that bad block is made? Even if it means the error might go unreported and hence uncorrected for weeks or months because no access is made to that bad data? > > What context comes along with that > > notification? > > The only notification the file system gets is -EIO on access. > However, assuming we had a DAX-aware RAID driver what infrastructure > would we need to prevent SIGBUS from reaching the application if we > happened to have a redundant copy of the data? We'd need the same infrastructure at the filesystem layer would require if it has a redundant copy of the data. I don't know what that is, however, because I know very little about about MCEs and signal delivery (which is why I asked this question). [....] > The in-kernel recovery path, assuming RAID is present, needs more > thought especially considering the limited NMI context of a machine > check notification and the need to trap back into driver code. This is precisely the problem I am asking about - I know there is a limited context, but how exactly is it limited and what can we actually do from this context? e.g. Can we schedule recovery work on other CPU cores and wait for it to complete in a MCE notification handler? > I see > the code in fs/dax.c getting involved to translate a > process-physical-address back to a sector, but otherwise the rest of > the filesystem need not be involved. More /assumptions/ about filesystems not containing or being able to recover from redudant copies of data. > > > - how does the filesystem query the bad block list without > > adding layering violations? > > Why does the file system need to read the list? > Apologies for answering this question with a question, but these > patches don't assume the filesystem will do anything with a bad block > list. People keep talking about FIEMAP reporting bad blocks in files! How the fuck are we supposed to report bad blocks in a file via FIEMAP if the filesystem can't access the bad block list? > > - Is the filesystem expectd to find the active application or > > address_space access that triggered the bad block > > notification to handle them correctly? (e.g. prevent a > > page fault from failing because we can recover from the > > error immediately) > > With these patches no, but it would be nice to incrementally add that > ability. I.e. trap machine check faults on non-anonymous memory and > send a request down the stack to recover the sector if the storage > layer has a redundant copy. Again, fs/dax.c would need extensions to > do this coordination, but I don't foresee the filesystem getting > involved beyond that point. Again, the /assumption/ here is that only the block layer has the ability to recover, and only sector mapping is required from the fs. > > - what exactly is the filesystem supposed to do with the bad > > block? e.g: > > - is the block persistently bad until the filesystem > > rewrites it? Over power cycles? Will we get > > multiple notifications (e.g. once per boot)? > > Bad blocks on persistent memory media remain bad after a reboot. Per > the ACPI spec the DIMM device tracks the errors and reports them in > response to an "address range scrub" command. Each boot the libnvdimm > sub-system kicks off a scrub and populates the bad block list per pmem > namespace. As mentioned above, we want to add the ability to > re-trigger this scrub on-demand, in response to a memcpy_from_pmem() > discovering an error, or after a SIGBUS is communicated to userspace. > > > - Is the filesystem supposed to intercept > > reads/writes to bad blocks once it knows about > > them? > > No, the driver handles that. So, -EIO will be returned to the filesystem on access? If -EIO, then we'll have to check over the bad block list to determine if data recovery operations are required, right? Perhaps we need a different error here to tell the higher layers it's a specific type of error (e.g. -EBADBLOCK)? > > - how is the filesystem supposed to communicate that > > there is a bad block in a file back to userspace? > > -EIO on access. So no consideration for proactive "data loss has occurred at offset X in file /mnt/path/to/file, attempting recovery" messages when the error is first detected by the lowest layers? > > Or is userspace supposed to infer that there's a > > bad block from EIO and so has to run FIEMAP to > > determine if the error really was due to a bad > > block? > > The information is there to potentially do forensics on why an I/O > encountered an error, but there is no expectation that userspace > follow up on each -EIO with a FIEMAP. Ok, so how is userspace driven error recovery supposed to work if it can't differentiate the cause of an EIO error? If there's no requirement for FIEMAP to report the bad blocks in a file that needs recovery, then what is the app supposed to do with the EIO? Indeed, what consideration has been given to ensuring the app knows aheadi of time that the filesystem FIEMAP implementation will report bad blocks if they exist? Of course, the filesystem has to know about the bad blocks to be able to do any of this with FIEMAP.... > > - what happens if there is no running application > > that we can report the error to or will handle the > > error (e.g. found error by a media scrub or during > > boot)? > > Same as RAID today, if the array is in sync the bad block will get > re-written during the scrub hopefully in advance of when an > application might discover it. If no RAID is present then the only > notification is an on-access error. More /assumptions/ that only device level RAID will be able to recover.... > > - if the bad block is in filesystem free space, what should > > the filesystem do with it? > > Nothing. When the free space becomes allocated we rely on the fact > that the filesystem will first zero the blocks. That zeroing process > will clear the media error. The incorrect /assumption/ here is that all allocations will do block zeroing first. That's simply wrong. We do that for *user data* in XFS and ext4, but we do not do it for metadata as they are not accessed by DAX and, being transactionally protected, don't need zeroing to prevent stale data exposure. Hence we have a problem here - the first write to such blocks may be metadata writeback of some type and so the filesystem will see EIO errors in metadata writes and they'll freak out. What now - does this really mean that we'll have to add special IO falback code for all internal IO paths to be able to clear pmem bad block errors? Oh, and just a thought: lots of people are pushing for selectable FALLOC_FL_NO_HIDE_STALE behaviour which will skip zeroing of data blocks on allocation. If this happens, it we also skip the zeroing on allocation, so again there is no mechanism to clear bad block status in this case..... [...] > > However, if I'm struggling to understand how I'm supposed to > > connecct up the parts inside a filesytem, then expecting application > > developers to be able to connect the dots in a sane manner is > > bordering on fantasy.... > > Hopefully it is becoming clearer that we are not proposing anything > radically different than what is present for error recovery today > modulo thinking about the mechanisms to trap and recover a DAX read of > a bad media area via a DM/MD implementation. There is a radical difference - there is a pervasive /assumption/ in what is being proposed that filesystems are incapable of storing redundant information that can be used for error recovery. The current IO stack makes no such assumptions, even if it doesn't provide any infrastructure for such functionality. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html