On Thu, Dec 15, 2011 at 5:01 AM, Juergen Beisert <jbe@xxxxxxxxxxxxxx> wrote: > Hi Robert, > > Robert Hancock wrote: >> On 12/14/2011 02:48 AM, Juergen Beisert wrote: >> > I have a CF card running in true-ide mode connected to regular PC. This >> > CF card does wear leveling of its flash memory internally like every >> > other CF card. With one exception: When the CF's firmware detects a >> > broken NAND page while writing a sector, it moves around the remaining >> > (good) data to other pages. To do this job it must discard the already >> > transmitted sector data in its SRAM, because it needs this SRAM to move >> > around the other flash memory data. >> > >> > After the movement the firmware signals an 'ATA_ERR' in the status >> > register and an 'ATA_ABORTED' in the error register to force the host to >> > repeat to write the same data again (next time it will be successfull due >> > to internal wear leveling is already done). >> > >> > As we see data lost when the systems are running in production, I'm now >> > trying to find out if the libata/SCSI layer really repeats the sector >> > write for this case and does the expected (or required) things. But I'm >> > lost in these software layers and their error path. >> > >> > I found (in Documentation/DocBook/libata.tmpl): >> > >> > "This is indicated by UNC bit in the ERROR register. ATA >> > devices reports UNC error only after certain number of >> > retries cannot recover the data, so there's nothing much >> > else to do other than notifying upper layer." >> > >> > which sounds to me as no repeat will happen for write errors, but >> > the 'ATA_UNC' bit is not used to signal the "wear leveling case" shown >> > above. >> >> That seems like incorrect behavior by the device, ABRT is normally used >> to indicate an invalid or unsupported command. UNC would likely be more >> appropriate. But I don't think it ultimately makes a difference in this >> case. > > Okay. > >> > As far as I understand the ATA errors are transformed to SCSI errors and >> > then handled in the SCSI layer. But the documentation tells me it is not >> > easy to always find an adequate SCSI error for an ATA error. So, I'm not >> > sure if for the "wear leveling case" the SCSI layer receives a "valuable" >> > error message. >> >> From what I can see the SCSI error that gets returned in this case is >> just an "aborted command" error. >> >> > Does anybody can give me a hint, what really happens when the attached >> > drive signals an 'ATA_ABORTED'? Does the libata/SCSI give up in this >> > case, or will it repeat the command? >> >> I don't know that the SCSI or block layers really pay much attention to >> the error code in this case - I think it would always attempt some retries. > > As far as I understand the problem of this kind of errors is for the multi > sector write case. The framework does not know what sectors fails, so the > question is: does it repeat the whole multi sector sequence or what else it > does? The entire request should get retried. > >> Certainly any of these errors would result in error messages showing up >> in dmesg. Are you seeing any of this? > > Are they enabled by default? Or more like debug messages? We see broken > filesystems and data lost, but currently no related messages in the kernel's > log. This could mean there are no such failures or the messages are not > enabled. They should always be enabled. If you don't get any, then presumably the device is not raising any errors. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html