On 2011-05-30, Andreas Dilger <adilger@xxxxxxxxx> wrote: >> If I use a hardware reset method instead of the kernel syscall, by >> triggering a watchdog with interrupts locked, or doing a power cycle >> with a testing machine, the problem does not happen. This led me to >> think it could be a software failure, rather than the hardware failure I >> was expecting. After activating the traces in the mmc subsystem, I >> finally managed to catch write commands to an area outside the partition >> being tested, which means that the problem is really due to software. > > Why don't you dump a stack at that point to see what is causing the > write? Also, blktrace might be helpful to determine what caused the block > to be written. I tried that, unfortunately the asynchronous I/O framework led me to have the stack of the mmc worker thread, instead of the stack of the request originator. But it was a good first step, since it gave me an error marker, and made me notice that the problem is much more common than I thought. It was only hidden due to the fact that the writes fell in unused areas of my boot partition. Since blktrace lives in userspace, it is liable to be destroyed during the reboot process, and give me only partial information. But I finally found what I wanted: by writing 1 to /proc/sys/vm/block_dump, I am able to see the original requests that led to the commands in the system log. >From what I see now, it seems that the problem comes from a race condition on shutdown between pending file system operations on one side, and partition removal on the other side. It seems that the partition can be removed, and yet some pending requests are still valid, and are handled with the partition offset equal to 0. This leads to the corruptions I am observing. I have yet to figure the events leading to this, and find a correction, since all this is happening in a part I'm not familiar of. > Another possibility (I'm not very familiar with MMC hardware, so could > be bogus) is that the partitions don't align to the hardware/erase > block size of the underlying device, and a "legitimate" write to one > partition is causing a read-modify-write into a region of another > partition, but this isn't being handled correctly? > I also had alignment problems, but it only impacted performance, not correctness. Thanks for your help, -- Romain Izard -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html