Hi, Slava, Slava Dubeyko <Vyacheslav.Dubeyko@xxxxxxx> writes: >>The data is lost, that's why you're getting an ECC. It's tantamount >>to -EIO for a disk block access. > > I see the three possible cases here: > (1) bad block has been discovered (no remap, no recovering) -> data is >> lost; -EIO for a disk block access, block is always bad; This is, of course, a possiblity. In that case, attempts to clear the error will not succeed. > (2) bad block has been discovered and remapped -> data is lost; -EIO > for a disk block access. Right, and the error is cleared when new data is provided (i.e. through a write system call or fallocate). > (3) bad block has been discovered, remapped and recovered -> no data is lost. This is transparent to the OS and the application. >>> Let's imagine that the affected address range will equal to 64 bytes. >>> It sounds for me that for the case of block device it will affect the >>> whole logical block (4 KB). >> >> 512 bytes, and yes, that's the granularity at which we track errors >> in the block layer, so that's the minimum amount of data you lose. > > I think it depends what granularity hardware supports. It could be 512 > bytes, 4 KB, maybe greater. Of course, though I expect the ECC protection in the NVDIMMs to cover a range much smaller than a page. >>> The situation is more critical for the case of DAX approach. Correct >>> me if I wrong but my understanding is the goal of DAX is to provide >>> the direct access to file's memory pages with minimal file system >>> overhead. So, it looks like that raising bad block issue on file >>> system level will affect a user-space application. Because, finally, >>> user-space application will need to process such trouble (bad block >>> issue). It sounds for me as really weird situation. What can protect a >>> user-space application from encountering the issue with partially >>> incorrect memory page? >> >> Applications need to deal with -EIO today. This is the same sort of thing. >> If an application trips over a bad block during a load from persistent memory, >> they will get a signal, and they can either handle it or not. >> >> Have a read through this specification and see if it clears anything up for you: >> http://www.snia.org/tech_activities/standards/curr_standards/npm > > Thank you for sharing this. So, if a user-space application follows to the > NVM Programming Model then it will be able to survive by means of catching > and processing the exceptions. But these applications have to be implemented yet. > Also such applications need in special technique(s) of recovering. It sounds > that legacy user-space applications are unable to survive for the NVM.PM.FILE mode > in the case of load/store operation's failure. By legacy, I assume you mean those applications which mmap file data and use msync. Those applications already have to deal with SIGBUS today when a disk block is bad. There is no change in behavior. If you meant legacy applications that use read/write, they also should see no change in behavior. Bad blocks are tracked in the block layer, and any attempt to read from a bad area of memory will get -EIO. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html