Re: mdadm seems not be doing rewrites on unreadable blocks

CoolCold <coolthecold@xxxxxxxxx> · Tue, 30 Nov 2010 13:40:25 +0300



On Tue, Nov 30, 2010 at 3:52 AM, Neil Brown <neilb@xxxxxxx> wrote:
> On Mon, 29 Nov 2010 15:23:56 +0000 Philip Hands <phil@xxxxxxxxx> wrote:
>
>> Hi,
>>
>> I have a server with some 2TB disks, that are partitioned, and those
>> partitions assembled as RAID1's.
>>
>> One of the disks has been showing non-zero Current_Pending_Sectors in
>> smart, so I've added more disks to the machine, partitioned one of the
>> new disks, and added each of it's partitions to the relevant RAID,
>> growing the raid to three devices to force the data to be written to the
>> new disk.
>>
>> Initially, I did this under single user mode, so that was the only thing
>> going on on the machine.
>>
>> One of the old drives (/dev/sda at the time, and the first disk in the
>> RAID0) then started throwing lots of errors, which seemed to take a long
>> time to resolve each -- watching this made me think that, under the
>> circumstances, rather than continuing to read only from /dev/sda, it
>> might be bright to try reading from /dev/sdb (the other original disk)
>> in order to provide the data for /dev/sdc (the new disk).
>
> I assume you mean "RAID1" where you wrote "RAID0" ??
>
> md has no knowledge of IO taking a long time.  If it works, it works.  If it
> doesn't, md tries to recover.  If it got a read error it should certainly try
> to read from a different device and write the data back.
>
>>
>> Also, I got the impression that the data on the unreadable blocks was
>> not being written back to /dev/sda once it was finally read from
>> /dev/sdb (although confirming that wasn't easy when on the console, with
>> errors pouring up the screen, and the system being rather unresponsive,
>> so I rebooted -- after the reboot, it seemed to be getting along better,
>> so I put it back in production).
>>
>> After waiting the several days it took to allow the third disk to be
>> populated with data, I thought I'd try forcing the unreadable sectors to
>> be written, to get them remapped if they were really bad, or just to get
>> rid of the Current_Pending_Sector count if it was just a case of the
>> sectors being corrupt but the physical sector being OK.
>>
>> [BTW After some rearrangement while I was doing the install, the
>> doubtful disk is now /dev/sdb, while the newly copied disk is /dev/sdc]
>>
>> So choosing one of the sectors in question, I did:
>>
>>   root#  dd bs=512 skip=19087681 seek=19087681 count=1 if=/dev/sdc of=/dev/sdb
>>   dd: writing `/dev/sdb': Input/output error
>>   1+0 records in
>>   0+0 records out
>>   0 bytes (0 B) copied, 11.3113 s, 0.0 kB/s
>
> You should probably had added oflag=direct.
>
>
> When you write 512 byte blocks to a block device, it will read a 4096 byte
> block, update the 512 bytes, and write the 4096 bytes back.
>
>
>>
>> Which gives rise to this:
>>
>> [325487.740650] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>> [325487.740746] ata2.00: irq_stat 0x00060002, device error via D2H FIS
>> [325487.740841] ata2.00: failed command: READ DMA
>
> Yep.  read error while trying to pre-read the 4K block.
Hmm, is true for any block device? i.e. if blockdev --getss reports
sector size is 512 byte. Or this is related to page size?

>
>
>> [325487.740924] ata2.00: cmd c8/00:08:40:41:23/00:00:00:00:00/e1 tag 0 dma 4096 in
>> [325487.740925]          res 51/40:00:41:41:23/00:00:01:00:00/e1 Emask 0x9 (media error)
>> [325487.741153] ata2.00: status: { DRDY ERR }
>> [325487.741230] ata2.00: error: { UNC }
>> [325487.749790] ata2.00: configured for UDMA/100
>> [325487.749797] ata2: EH complete
>> [325489.757669] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>> [325489.757759] ata2.00: irq_stat 0x00060002, device error via D2H FIS
>> [325489.757852] ata2.00: failed command: READ DMA
>> [325489.757936] ata2.00: cmd c8/00:08:40:41:23/00:00:00:00:00/e1 tag 0 dma 4096 in
>> [325489.757937]          res 51/40:00:41:41:23/00:00:01:00:00/e1 Emask 0x9 (media error)
>> [325489.758165] ata2.00: status: { DRDY ERR }
> ....
>
>
>> If I use hdparm's --write-sector on the same sector, it succeeds, and
>> the dd then succeeds (unless there's another sector following that's
>> also bad).  This doesn't end up resulting in Reallocated_Sector_Ct
>> increasing (it's still zero on that disk), so it seems that the disk
>> thinks the physical sector is fine now that it's been written.
>>
>> I get the impression that for several of the sectors in question,
>> attempting to write the bad sector revealed a sector one or two
>> further into the disk that was also corrupt, so despite writing about 20
>> of them, the Pending sector count has actually gone up from 12 to 32.
>>
>> Given all that, it seems like this might be a good test case, so I
>> stopped fixing things in the hope that we'd be able to use the bad
>> blocks for testing.
>>
>> I have failed the disk out of the array though (which might be a bit of
>> an mistake from the testing side of things, but seemed prudent since I'm
>> serving live data from this server).
>>
>> So, any suggestions about how I can use this for testing, or why it
>> appears that mdadm isn't doing it's job a well as it might?  I would
>> think that it should do whatever hdparm's --write-sector does to get the
>> sector writable again, and then write the data back from the good disk,
>> since leaving it with the bad blocks means that the RAID is degraded for
>> those blocks at least.
>
> What exactly did you want to test, and what exactly makes you think md isn't
> doing its job properly?
>
> By the sound of it, the drive is quite sick.
> I'm guessing that you get read errors, md tries to write good data and
> succeeds, but then when you later come to read that block again you get
> another error.
>
> I would suggest using dd (With a large block size) to write zero all over the
> device, then see if it reads back with no errors.  My guess is that it won't.
>
> NeilBrown
>
>
>
>>
>> If it really cannot rewrite the sector then should it not be declaring
>> the disk faulty?  Not that I think that would be the best thing to do in
>> this circumstance, since it's clearly not _that_ faulty, but blithely
>> carrying on when some of the data is no longer redundant seems broken as
>> well.
>


-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html