(root cause found....may be not)64k Page size + ext3 errors

tirumalareddy marri <tirumalareddymarri@xxxxxxxxx> · Fri, 1 Aug 2008 17:07:53 -0700 (PDT)

After lots of debugging and dumping file system information. I found that  super block is being corrupted  during SATA dma transfer.  I am using PCI-E based SATA card to attach hard disks. Looks with 64k page size SATA DMA seems to be stressed so much compared to 4k page size. I used another SATA card which is more stable(it does not use  libata). It worked finw with RAID-5 and 64k page size.
  I have used a small C program to create w2GB size file and read it back and check the data consistency. So far no errors found. I also used IO meter test , which worked fine too. 
All thank you very much for the suggestions and responses.
Regards,
Marri

----- Original Message ----
From: Roger Heflin <rogerheflin@xxxxxxxxx>
To: tirumalareddy marri <tirumalareddymarri@xxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: Monday, July 28, 2008 5:33:34 PM
Subject: Re: 64k Page size + ext3 errors

tirumalareddy marri wrote:
> Hi Roger,
>    I did sync after I copied the 128MB data. Isn't that should guarantee data is flushed to disk ? I am using "sum" command to check if data file is copied to Disk is valid or not. 

It means it will be flushed to disk, it does not mean that when you read it back 
that will come off disk, if it is still in memory then it will come out of 
memory, and still be wrong on disk.    If you won't want to to more complicated 
test it might be best to create the file, csum it and if it is ok umount the 
device and remount it and csum it again and see, this should at least force it 
to come off of disk again.

How much memory does your test machine have?

> Here is more information.
> setup: Created /dev/md0 of 30GB size , created ext3 files system. Then started SAMBA server to export mountded /dev/md0 to a windows machine to run IO and copy files.
> 4K Page size:
> -------------------
> 1. IO Meter Test: Works just fine.

None of the benchmarks I am familiar with actually confirm that the data is 
good, the only way one of the benchmarks will fail is if the file table gets 
corrupted, and they may run in cache.

> 2. Copied 1.8 GB file and check sum is good.
> 3. Performance is not good because of small page size.
> 16k Page size:
> ---------------------
> 1. RAID-5 fails some times with " Attempt to access beyond the end of device"
> 2. Copied 128MB and 385MB file. Checked check sum, they are matching check sum.
> 3. Copied 1.8 GB file , this failed checksum test using "sum" command. I see "EXT3-fs errors".
> 64K Page size:
> ----------------------
> 1. RAID-5 failes some times with "Attempt to access beyond the end of device"
> 2. Able to copy 128MB data and check sum test passed.
> 3. Copying 385MB and 1.8 GB file with EXT3-fs errors.
> Thanks,
> Marri

I would write directly to the /dev/mdx a specific pattern (a stream of binary 
numbers from 1 ... whatever works fine), and then read that back and see how 
things match or don't.  csum *can* fail, and if you have enough memory then any 
corruption actually on disk *WON'T* be found until somethings causes it to be 
ejected from cache, and then later re-read from disk.

                              Roger

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html