On Tue, 2017-01-17 at 17:58 -0800, Andiry Xu wrote: > On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@xxxxxxxx > m> wrote: > > On 01/17, Andiry Xu wrote: > > > > <snip> > > > > > > > > > > > > The pmem_do_bvec() read logic is like this: > > > > > > > > > > pmem_do_bvec() > > > > > if (is_bad_pmem()) > > > > > return -EIO; > > > > > else > > > > > memcpy_from_pmem(); > > > > > > > > > > Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this > > > > > imply > > > > > that even if a block is not in the badblock list, it still can > > > > > be bad > > > > > and causes MCE? Does the badblock list get changed during file > > > > > system > > > > > running? If that is the case, should the file system get a > > > > > notification when it gets changed? If a block is good when I > > > > > first > > > > > read it, can I still trust it to be good for the second > > > > > access? > > > > > > > > Yes, if a block is not in the badblocks list, it can still cause > > > > an > > > > MCE. This is the latent error case I described above. For a > > > > simple read() > > > > via the pmem driver, this will get handled by memcpy_mcsafe. For > > > > mmap, > > > > an MCE is inevitable. > > > > > > > > Yes the badblocks list may change while a filesystem is running. > > > > The RFC > > > > patches[1] I linked to add a notification for the filesystem > > > > when this > > > > happens. > > > > > > > > > > This is really bad and it makes file system implementation much > > > more > > > complicated. And badblock notification does not help very much, > > > because any block can be bad potentially, no matter it is in > > > badblock > > > list or not. And file system has to perform checking for every > > > read, > > > using memcpy_mcsafe. This is disaster for file system like NOVA, > > > which > > > uses pointer de-reference to access data structures on pmem. Now > > > if I > > > want to read a field in an inode on pmem, I have to copy it to > > > DRAM > > > first and make sure memcpy_mcsafe() does not report anything > > > wrong. > > > > You have a good point, and I don't know if I have an answer for > > this.. > > Assuming a system with MCE recovery, maybe NOVA can add a mce > > handler > > similar to nfit_handle_mce(), and handle errors as they happen, but > > I'm > > being very hand-wavey here and don't know how much/how well that > > might > > work.. > > > > > > > > > No, if the media, for some reason, 'dvelops' a bad cell, a > > > > second > > > > consecutive read does have a chance of being bad. Once a > > > > location has > > > > been marked as bad, it will stay bad till the ACPI clear error > > > > 'DSM' has > > > > been called to mark it as clean. > > > > > > > > > > I wonder what happens to write in this case? If a block is bad but > > > not > > > reported in badblock list. Now I write to it without reading > > > first. Do > > > I clear the poison with the write? Or still require a ACPI DSM? > > > > With writes, my understanding is there is still a possibility that > > an > > internal read-modify-write can happen, and cause a MCE (this is the > > same > > as writing to a bad DRAM cell, which can also cause an MCE). You > > can't > > really use the ACPI DSM preemptively because you don't know whether > > the > > location was bad. The error flow will be something like write causes > > the > > MCE, a badblock gets added (either through the mce handler or after > > the > > next reboot), and the recovery path is now the same as a regular > > badblock. > > > > This is different from my understanding. Right now write_pmem() in > pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it > clears poison and writes to pmem again. Seems to me writing to bad > blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores? You are right, writes don't use memcpy_mcsafe, and will not directly cause an MCE. However a write can cause an asynchronous 'CMCI' - corrected machine check interrupt, but this is not critical, and wont be a memory error as the core didn't consume poison. memcpy_mcsafe cannot protect against this because the write is 'posted' and the CMCI is not synchronous. Note that this is only in the latent error or memmap-store case. > > Thanks, > Andiry > > > > > > > > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html > > > > > > > > > > Thank you for the patchset. I will look into it. > > > ��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥