Hi Alex, [resend] some updates on the patch, unfortunately, it is still reproduceable after the patch is applied in 3.2.0-30.48 of the precise tree git://kernel.ubuntu.com/ubuntu/ubuntu-precise.git we also found the patch was already included in Ubuntu-3.5.0-15.22, from the quantal tree on the following url git://kernel.ubuntu.com/ubuntu/ubuntu-quantal.git this had the same issues. Best Regards. Chris On Tue, Sep 25, 2012 at 2:09 PM, Christian Huang <ythuang@xxxxxxxxx> wrote: > Hi Alex, > > some additional info on the verification we did on 3.6-rc7 > we used Ubuntu 12.10 as base OS > > 1. setup a 2 OSD cluster > 2. setup a rbd test client > 3. setup a netconsole monitoring node > > on one of the OSD nodes > a. setup a cronjob to shutdown network every 4 minutes and restart > it 1 minute later. > > on the test client > a. setup netconsole to redirect log to monitoring node > b. run the following commands in loop, continuosly > fio --iodepth=32 --numjobs=8 --runtime=120 --ioengine=libaio > --group_reporting --direct=1 --eta=always --name=job --bs=65536 > --rw=100 --filename=/dev/rbd0 > fio --iodepth=32 --numjobs=8 --runtime=120 --ioengine=libaio > --group_reporting --direct=1 --eta=always --name=job --bs=65536 --rw=0 > --filename=/dev/rbd0 > > we have run this for around 5 hours, 53 iterations, with no panics. > > crontab entry > * * * * * root /path/to/cronjob > === cron job === > #!/bin/bash > > if [ $[`date +%M` % 4] == 0 ] > then > echo 'network stop' > ifconfig eth0 down > else > echo 'network start' > ifconfig eth0 up > fi > === cron job === > > === fio installation === > apt-get install -y libaio* > git clone git://git.kernel.dk/fio.git > cd fio > git checkout fio-2.0.3 > make > sudo make install > > On Tue, Sep 25, 2012 at 12:33 PM, Christian Huang <ythuang@xxxxxxxxx> wrote: >> Hi Alex, >> is this issue what you are referring to? >> http://tracker.newdream.net/issues/2260 >> >> we will give the patch a try and see if resolves the issue. >> >> Best Regards. >> Chris. >> >> On Tue, Sep 25, 2012 at 11:38 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>> On 09/24/2012 08:25 PM, Christian Huang wrote: >>>> Hi Alex, >>>> we have used several kernel versions, some built from source, >>>> some stock kernel, from ubuntu repository. >>>> >>>> for the version you are referring to, we used a stock kernel from >>>> ubuntu repository. >>>> >>>> for building from source, we follow instructions from this page >>>> http://blog.avirtualhome.com/compile-linux-kernel-3-2-for-ubuntu-11-10/ >>>> and use the following tag from precise git repo. >>>> Ubuntu-3.2.0-29.46 >>> >>> These two bits of information: >>> >>>> please also note that we reproduced the issue with kernel 3.5.4 >>>> from kernel ppa >>>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >>>> >>>> it seems the following version does not have the issue >>>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/ >>> >>> ...are very helpful. >>> >>> There is a very important bug that got fixed between those two >>> releases, and it has symptoms like what you are reporting. >>> I can't say with 100% confidence that you are hitting this, but >>> it it appears you could be. >>> >>> The fix is very simple, and you should be able to patch your own >>> code to check to see if it makes the problem go away. If you >>> do, please report back whether you find it fixes the problem. >>> >>> Tomorrow I'll see if I can trace the particulars of the problem >>> you are reporting to this issue. >>> >>> -Alex >>> >>> From 02f7c002c9af475df6b2a1b64066bcdaf53cb7dc Mon Sep 17 00:00:00 2001 >>> From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> >>> Date: Wed, 6 Jun 2012 19:35:55 -0500 >>> Subject: [PATCH] rbd: Clear ceph_msg->bio_iter for retransmitted message >>> >>> The bug can cause NULL pointer dereference in write_partial_msg_pages >>> >>> Signed-off-by: Zheng Yan <zheng.z.yan@xxxxxxxxx> >>> Reviewed-by: Alex Elder <elder@xxxxxxxxxxx> >>> (cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037) >>> --- >>> net/ceph/messenger.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c >>> index f0e34ff..d372b34 100644 >>> --- a/net/ceph/messenger.c >>> +++ b/net/ceph/messenger.c >>> @@ -563,6 +563,10 @@ static void prepare_write_message(struct >>> ceph_connection *con) >>> m->hdr.seq = cpu_to_le64(++con->out_seq); >>> m->needs_out_seq = false; >>> } >>> +#ifdef CONFIG_BLOCK >>> + else >>> + m->bio_iter = NULL; >>> +#endif >>> >>> dout("prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n", >>> m, con->out_seq, le16_to_cpu(m->hdr.type), >>> -- >>> 1.7.9.5 >>> >>> >>> >>> >>>> Best Regards. >>>> Chris. >>>> On Tue, Sep 25, 2012 at 6:59 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>>>> On 09/24/2012 05:23 AM, Christian Huang wrote: >>>>>> Hi, >>>>>> we met the following issue while testing ceph cluster HA. >>>>>> Appreciate if anyone can shed some light. >>>>>> could this be related to the configuration ? (ie, 2 OSD nodes only) >>>>> >>>>> It appears to me the kernel that was in use for the crash logs >>>>> you provided was built from source. If that is the case, are you >>>>> able to provide me with the precise commit id so I am sure to >>>>> be working with the right code? >>>>> >>>>> Here is a line that leads me to that conclusion: >>>>> >>>>> [ 203.172114] Pid: 1901, comm: kworker/0:2 Not tainted 3.2.0-29-generic >>>>> #46-Ubuntu Wistron Cloud Computing/P92TB2 >>>>> >>>>> If you wish I would be happy to work with one of the other versions >>>>> of the code, but would prefer to also have crash information that >>>>> matches the source code I'm looking at. Thank you. >>>>> >>>>> -Alex >>>>> >>>>> >>>>>> Issue description: >>>>>> ceph rbd client will kernel panic if an OSD server loses it's >>>>>> network connectivity. >>>>>> so far, we can reproduce it with certainty. >>>>>> we have tried with the following kernels >>>>>> a. Stock kernel from 12.04 (3.2 series) >>>>>> 3.5 series, as suggested in a previous mail by Sage >>>>>> b. 3.5.0-15 from quantal repo, >>>>>> git://kernel.ubuntu.com/ubuntu/ubuntu-quantal.git, Ubuntu-3.5.0-15.22 >>>>>> tag >>>>>> c. v3.5.4-quantal, >>>>>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >>>>>> >>>>>> Environment: >>>>>> OS: Ubuntu 12.04 precise pangolin >>>>>> Ceph configuration: >>>>>> OSD nodes: 2 x 12 drives , 1 os drive, 11 are mapped to OSD >>>>>> 0-10, 10GbE link >>>>>> Monitor nodes: 3 x KVM virtual machines on ubuntu host. >>>>>> test client: fresh install of Ubuntu 12.04.1 >>>>>> Ceph version used: 0.48, 0.48.1, 0.48.2, 0.51 >>>>>> all nodes have the same kernel version. >>>>>> >>>>>> steps to reproduce: >>>>>> on the test client, >>>>>> 1. load rbd modules >>>>>> 2. create rbd device >>>>>> 3. map rbd device >>>>>> 4. use fio tool to create work load on the device, 8 threads is >>>>>> used for workload >>>>>> we have also tried with iometer, 8 workers, 32k 50/50, same results. >>>>>> >>>>>> on one of the OSD nodes, >>>>>> 1. sudo ifconfig eth0 down #where eth0 is the primary interface >>>>>> configured for ceph. >>>>>> 2. within 30 seconds, the test client will panic. >>>>>> >>>>>> this happens when there is IO activity on the RBD device, and one >>>>>> of the OSD nodes loses connectivity. >>>>>> >>>>>> The netconsole output is available available from the following >>>>>> dropbox link, >>>>>> zip: goo.gl/LHytr >>>>>> >>>>>> Best Regards >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>> >>> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html