On 01/18, Jan Kara wrote: > On Tue 17-01-17 15:37:05, Vishal Verma wrote: > > I do mean that in the filesystem, for every IO, the badblocks will be > > checked. Currently, the pmem driver does this, and the hope is that the > > filesystem can do a better job at it. The driver unconditionally checks > > every IO for badblocks on the whole device. Depending on how the > > badblocks are represented in the filesystem, we might be able to quickly > > tell if a file/range has existing badblocks, and error out the IO > > accordingly. > > > > At mount the the fs would read the existing badblocks on the block > > device, and build its own representation of them. Then during normal > > use, if the underlying badblocks change, the fs would get a notification > > that would allow it to also update its own representation. > > So I believe we have to distinguish three cases so that we are on the same > page. > > 1) PMEM is exposed only via a block interface for legacy filesystems to > use. Here, all the bad blocks handling IMO must happen in NVDIMM driver. > Looking from outside, the IO either returns with EIO or succeeds. As a > result you cannot ever ger rid of bad blocks handling in the NVDIMM driver. Correct. > > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are > mostly interested in. We could possibly do something more efficient than > what NVDIMM driver does however the complexity would be relatively high and > frankly I'm far from convinced this is really worth it. If there are so > many badblocks this would matter, the HW has IMHO bigger problems than > performance. Correct, and Dave was of the opinion that once at least XFS has reverse mapping support (which it does now), adding badblocks information to that should not be a hard lift, and should be a better solution. I suppose should try to benchmark how much of a penalty the current badblock checking in the NVVDIMM driver imposes. The penalty is not because there may be a large number of badblocks, but just due to the fact that we have to do this check for every IO, in fact, every 'bvec' in a bio. > > 3) PMEM filesystem - there things are even more difficult as was already > noted elsewhere in the thread. But for now I'd like to leave those aside > not to complicate things too much. Agreed that that merits consideration and a whole discussion by itself, based on the points Audiry raised. > > Now my question: Why do we bother with badblocks at all? In cases 1) and 2) > if the platform can recover from MCE, we can just always access persistent > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that > seems to already happen so we just need to make sure all places handle > returned errors properly (e.g. fs/dax.c does not seem to) and we are done. > No need for bad blocks list at all, no slow down unless we hit a bad cell > and in that case who cares about performance when the data is gone... Even when we have MCE recovery, we cannot do away with the badblocks list: 1. My understanding is that the hardware's ability to do MCE recovery is limited/best-effort, and is not guaranteed. There can be circumstances that cause a "Processor Context Corrupt" state, which is unrecoverable. 2. We still need to maintain a badblocks list so that we know what blocks need to be cleared (via the ACPI method) on writes. > > For platforms that cannot recover from MCE - just buy better hardware ;). > Seriously, I have doubts people can seriously use a machine that will > unavoidably randomly reboot (as there is always a risk you hit error that > has not been uncovered by background scrub). But maybe for big cloud providers > the cost savings may offset for the inconvenience, I don't know. But still > for that case a bad blocks handling in NVDIMM code like we do now looks > good enough? The current handling is good enough for those systems, yes. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html