Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

Soeren Sonnenburg <kernel@xxxxxx> · Tue, 23 Oct 2007 08:57:45 +0200

On Mon, 2007-10-22 at 12:59 +0200, Bernd Schubert wrote: 
> On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
> > On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote:
> > > Hello,
> > >
> > > On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
> > > > Helo,
> > > > [...]
> > > >
> > > > > Now when I write large files of zeros to root(sda&sdb) and read the
> > > > > file back in it contains a few nonzero entries:
> > > > >
> > > > > # dd if=/dev/zero of=/foo bs=1M count=2000
> > > > > # hexdump /foo
> > > > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > *
> > > > > <after >1GB random parts, within large blocks of zeroes>
> > > > >
> > > > > I can reliably trigger this on the md0 / devmapper-root setup when I
> > > > > write about 2GB of data (note that this machine has 1.5G of memory -
> > > > > and still 1GB is often enough to see this problem). Here it does not
> > > > > matter where in the filesystem I do these writes.
> > >
> > > Thats almost the same test as I'm always doing. Only I do not write only
> > > 2GB,
> >
> > Well when I read your mail I thought that I could be seeing exactly the
> > same bug... it still may be. However ``my'' problem does not go away
> > with the mod15fix ...
> 
> Yeah, pity it did not fix it :( I will try to port Tejuns patch 
> (http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23 today or 
> tomorrow. If you are testing anyway, could you then also try this?

Hmmhh, dmesg said the m15 fix was turned on (at least it appeared for
the 2 drives in question in dmesg), so I fear it is something different.
On the other hand this is a 'production' machine so I am not too eager
to try very experimental things...

> > > but as much as it fits onto the disk. On reading back this file, the
> > > filesystem will report errors somewhere between 50GB and 230GB (disk size
> > > is 250GB).
> >
> > Wow, I really see lots of corruptions (well every 1-2 GB a couple of
> > bytes are corrupted). Are you getting similiarly many in the 50G - 230G
> > region?
> >
> > > > Thanks.  I'll try to reproduce the problem here.  What's your
> > > > motherboard?
> > >
> > > All tested S2882 boards here.
> >
> > I assume all equipped with lots of memory and mostly empty pci slots?
> 
> Yes, all pci-slots are free and the systems to have between 4 and 16GB memory 
> (ecc, monitored with edac). Well, those are cluster systems (actually tyan 
> names those B2882).
> Do you think the configuration is related? Here it also happens with odirect, 
> we tested this to minimize memory effects.

Mine is just a a7v8x with via KT400 chipset... really old, but several
of the pci slots are filled, so the problem may be more likely to happen
it may happen here... on the other hand I never tried writing 50-250G on
the drives I considered OK. Will do. Also what could be helpful is that
we both see patterns in the corruptions, like corruptions are always 512
bytes long or so (IIRC in my case they were only up to 64 bytes).

Soeren
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html