Hi Alex, is this issue what you are referring to? http://tracker.newdream.net/issues/2260 we will give the patch a try and see if resolves the issue. Best Regards. Chris. On Tue, Sep 25, 2012 at 11:38 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: > On 09/24/2012 08:25 PM, Christian Huang wrote: >> Hi Alex, >> we have used several kernel versions, some built from source, >> some stock kernel, from ubuntu repository. >> >> for the version you are referring to, we used a stock kernel from >> ubuntu repository. >> >> for building from source, we follow instructions from this page >> http://blog.avirtualhome.com/compile-linux-kernel-3-2-for-ubuntu-11-10/ >> and use the following tag from precise git repo. >> Ubuntu-3.2.0-29.46 > > These two bits of information: > >> please also note that we reproduced the issue with kernel 3.5.4 >> from kernel ppa >> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >> >> it seems the following version does not have the issue >> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/ > > ...are very helpful. > > There is a very important bug that got fixed between those two > releases, and it has symptoms like what you are reporting. > I can't say with 100% confidence that you are hitting this, but > it it appears you could be. > > The fix is very simple, and you should be able to patch your own > code to check to see if it makes the problem go away. If you > do, please report back whether you find it fixes the problem. > > Tomorrow I'll see if I can trace the particulars of the problem > you are reporting to this issue. > > -Alex > > From 02f7c002c9af475df6b2a1b64066bcdaf53cb7dc Mon Sep 17 00:00:00 2001 > From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> > Date: Wed, 6 Jun 2012 19:35:55 -0500 > Subject: [PATCH] rbd: Clear ceph_msg->bio_iter for retransmitted message > > The bug can cause NULL pointer dereference in write_partial_msg_pages > > Signed-off-by: Zheng Yan <zheng.z.yan@xxxxxxxxx> > Reviewed-by: Alex Elder <elder@xxxxxxxxxxx> > (cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037) > --- > net/ceph/messenger.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index f0e34ff..d372b34 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -563,6 +563,10 @@ static void prepare_write_message(struct > ceph_connection *con) > m->hdr.seq = cpu_to_le64(++con->out_seq); > m->needs_out_seq = false; > } > +#ifdef CONFIG_BLOCK > + else > + m->bio_iter = NULL; > +#endif > > dout("prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n", > m, con->out_seq, le16_to_cpu(m->hdr.type), > -- > 1.7.9.5 > > > > >> Best Regards. >> Chris. >> On Tue, Sep 25, 2012 at 6:59 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>> On 09/24/2012 05:23 AM, Christian Huang wrote: >>>> Hi, >>>> we met the following issue while testing ceph cluster HA. >>>> Appreciate if anyone can shed some light. >>>> could this be related to the configuration ? (ie, 2 OSD nodes only) >>> >>> It appears to me the kernel that was in use for the crash logs >>> you provided was built from source. If that is the case, are you >>> able to provide me with the precise commit id so I am sure to >>> be working with the right code? >>> >>> Here is a line that leads me to that conclusion: >>> >>> [ 203.172114] Pid: 1901, comm: kworker/0:2 Not tainted 3.2.0-29-generic >>> #46-Ubuntu Wistron Cloud Computing/P92TB2 >>> >>> If you wish I would be happy to work with one of the other versions >>> of the code, but would prefer to also have crash information that >>> matches the source code I'm looking at. Thank you. >>> >>> -Alex >>> >>> >>>> Issue description: >>>> ceph rbd client will kernel panic if an OSD server loses it's >>>> network connectivity. >>>> so far, we can reproduce it with certainty. >>>> we have tried with the following kernels >>>> a. Stock kernel from 12.04 (3.2 series) >>>> 3.5 series, as suggested in a previous mail by Sage >>>> b. 3.5.0-15 from quantal repo, >>>> git://kernel.ubuntu.com/ubuntu/ubuntu-quantal.git, Ubuntu-3.5.0-15.22 >>>> tag >>>> c. v3.5.4-quantal, >>>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >>>> >>>> Environment: >>>> OS: Ubuntu 12.04 precise pangolin >>>> Ceph configuration: >>>> OSD nodes: 2 x 12 drives , 1 os drive, 11 are mapped to OSD >>>> 0-10, 10GbE link >>>> Monitor nodes: 3 x KVM virtual machines on ubuntu host. >>>> test client: fresh install of Ubuntu 12.04.1 >>>> Ceph version used: 0.48, 0.48.1, 0.48.2, 0.51 >>>> all nodes have the same kernel version. >>>> >>>> steps to reproduce: >>>> on the test client, >>>> 1. load rbd modules >>>> 2. create rbd device >>>> 3. map rbd device >>>> 4. use fio tool to create work load on the device, 8 threads is >>>> used for workload >>>> we have also tried with iometer, 8 workers, 32k 50/50, same results. >>>> >>>> on one of the OSD nodes, >>>> 1. sudo ifconfig eth0 down #where eth0 is the primary interface >>>> configured for ceph. >>>> 2. within 30 seconds, the test client will panic. >>>> >>>> this happens when there is IO activity on the RBD device, and one >>>> of the OSD nodes loses connectivity. >>>> >>>> The netconsole output is available available from the following >>>> dropbox link, >>>> zip: goo.gl/LHytr >>>> >>>> Best Regards >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html