On Fri, 7 Mar 2014 14:18:59 +0530 "Manibalan P" <pmanibalan@xxxxxxxxxxxxxx> wrote: > Hi, Hi, when posting to vger.kernel.org lists, please don't send HTML mail, just plain text. Because you did the original email didn't get to the list. > > > > We are facing a data integrity issue on RAID 6. On CentOS 6.4 kernel. I don't know what kernel "CentOS 6.4" runs. Please report the actual kernel version as well as distro details. > > > > Details of the setup: > > > > 1. 7 drives Raid6 md devices (md0) - Capacity 25 GB > > 2. Resync speed max and min set to 100000 (100Mb) > > 3. A script is running to simulate drive failure, this script will > do the following > > a. Mdadm set faulty for two random drives on the md, the mdadm > remove those drives. > > b. Mdadm add ond drive, and wait for rebuild to complete, then > insert the next one. > > c. Wait till the md become optimal, and continue the disk removal > cycle again. > > 4. iSCSI target is configured to "/dev/md0" > > 5. From Windows server, the md0 target is connected using > MicroSoft iSCSI initiator, and formatted with NTFS. > > 6. Dit32 IO tool is running on the formatted volume. > > > > Issue#: > > The Dit32 tool will running IO in multiple threads, in > each thread, IO will be written and verified. > > And on the verification Cycle, we are getting > mis-compare. Below is the log from the dit32 tool. > > > > Thu Mar 06 23:19:31 2014 INFO: DITNT application started > > Thu Mar 06 23:20:19 2014 INFO: Test started on Drive D: > > Dir Sets=8, Dirs per Set=70, Files per Dir=75 > > File Size=512KB > > Read Only=N, Debug Stamp=Y, Verify During Copy=Y > > Build I/O Size range=1 to 128 sectors > > Copy Read I/O Size range=1 to 128 sectors > > Copy Write I/O Size range=1 to 128 sectors > > Verify I/O Size range=1 to 128 sectors > > Fri Mar 07 01:28:09 2014 ERROR: Miscompare Found: File > "D:\dit\s6\d51\s6d51f37", offset=00048008 > > Expected Data: 06 33 25 01 0240 (dirSet, dirNo, fileNo, elementNo, > sectorOffset) > > Read Data: 05 08 2d 01 0240 (dirSet, dirNo, fileNo, elementNo, > sectorOffset) > > Read Request: offset=00043000, size=00008600 > > > > This mail has been attached with the following files for your reference > > 1. Raid5.c and .h files, the Code what we are using. > > 2. RollingHotSpareTwoDriveFailure.sh - the script which simulates > the two disk failure. > > 3. dit32log.sav - Log file from the dit32 tool > > 4. s6d31f37 - the file where the corruption happened(hex format) > > 5. CentOS-system-info - md and system info > > I didn't find any "CentOS-system-info" attached. I know nothing about "dit32" and so can not easily interpret the output. Is it saying that just a few bytes were wrong? Was the array fully synced before you started the test? I can't think of anything else that might cause an inconsistency. I test the RAID6 recovery code from time to time and it always works flawlessly for me. NeilBrown > > > > Thanks, > > Manibalan. > > >
Attachment:
signature.asc
Description: PGP signature