Re: PROBLEM: write to jbod with 3TB and 160GB drives hits BUG/oops

NeilBrown <neilb@xxxxxxx> · Mon, 27 Apr 2015 11:11:59 +1000

On Fri, 24 Apr 2015 13:11:06 -0700 Charles Bertsch <cbertsch@xxxxxxx> wrote:

> On 04/23/2015 06:55 PM, NeilBrown wrote:
> >
> > By "jbod" I assume you mean "linear array".
> >
> > You say this happens without any filesystem on the array, yet the stack
> > traces clearly show ext2 in use.
> > Maybe some weird interaction is happening between the the filesystem and the
> > linear array.
> > But please confirm that the stack trace happened when there was no filesystem
> > on the array you were testing, and report what filesystems you do have which
> > use ext2.
> >
> Neil --
> Yes, I do mean linear array.
> 
> At the point of the stack trace, there was no file-system on the linear 
> 2-drive array.  The test-jbod-2 script would create the array and then 
> write directly to /dev/md0.  Any evidence of previous existence of a 
> file-system would have been obliterated by earlier runs copying 
> /dev/zero everywhere.
> 
> The file-systems in use --
> -- The rootfs is an initrd file, squashfs, and mounted read-only.
> -- An ext3 for configuration and logs is mounted RW on /flash
> -- An ext2 using 8MB of RAM is mounted RW on /var
> -- The file-server is derived from a much earlier design that required 
> some RW directories within the root.  These entries appear in the mount 
> command as ext2, but are part of /var (and not separate file systems) --
> -- mount --bind /var/hd /hd
> -- mount --bind /var/home /home
> 
> -- A devtmpfs mounted on /dev, tmpfs on /dev/shm, proc on /proc, sysfs 
> on /sys, and another mount --bind from within /flash for nfs.
> 
> # mount
> /dev/root on / type squashfs (ro,relatime)
> devtmpfs on /dev type devtmpfs 
> (rw,relatime,size=1002600k,nr_inodes=250650,mode=755)
> proc on /proc type proc (rw,relatime)
> sysfs on /sys type sysfs (rw,relatime)
> /dev/ram1 on /var type ext2 (rw,relatime,errors=continue)
> /dev/ram1 on /hd type ext2 (rw,relatime,errors=continue)
> /dev/ram1 on /home type ext2 (rw,relatime,errors=continue)
> tmpfs on /dev/shm type tmpfs (rw,relatime)
> /dev/sdb1 on /flash type ext3 
> (rw,noatime,errors=continue,commit=60,barrier=1,data=ordered)
> /dev/sdb1 on /var/lib/nfs type ext3 
> (rw,noatime,errors=continue,commit=60,barrier=1,data=ordered)
> nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
> #

Thanks for the details.
On the whole, I don't think it is likely that your problem is directly
related to md - just a coincidence that it happened when you were using md
things.  But one never knows until that actual cause is found.

> 
>  > Is there any chance you could use "git bisect" to find out exactly which
>  > commit introduced the problem?  That is the mostly likely path to a 
> solution.
>  >
> 
> 
> I am not familiar with "git bisect".  Would this be similar to 
> downloading a series of kernel releases from linux-3.3.5 up to 3.18.5 
> using a binary search to find which release (rather than which commit) 
> has the problem ?

Similar, but (some of) the boring work is all done for you.

It would be best to stick to mainline kernels for testing.  i.e. just '3.x',
not '3.x.y'.

So presumably 3.3 works, and 3.18 fails.
In that case:

   git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
   cd linux
   git bisect start
   git bisect good v3.3
   git bisect bad v3.18

That should get you started, except that it seems to take an incredibly long
time.  So probably do the first few steps by hand.
e.g

   git checkout v3.10

and test that.  Then try v3.7 or v3.14.

Once you know which of those are good or bad, run e.g.
  git bisect start
  git bisect good v3.7
  git bisect bad v3.10

and that will checkout a kernel somewhere in the middle and tell you there
are 14 (or so) steps to go.

Then build and test the kernel. If it is good, run "git bisect good".
If bad, "git bisect bad".

If you can persist through testing over a dozen kernels (takes some
patience!!) it should lead you to the commit that introduced the problem.
It is always best to be caution before declaring a kernel 'good' - run the
test a few times.

> 
> Thanks
> 
> Charles Bertsch

NeilBrown
Attachment:
pgpXGjxYKmilh.pgp

Description: OpenPGP digital signature