Hi Alex, some additional info on the verification we did on 3.6-rc7 we used Ubuntu 12.10 as base OS 1. setup a 2 OSD cluster 2. setup a rbd test client 3. setup a netconsole monitoring node on one of the OSD nodes a. setup a cronjob to shutdown network every 4 minutes and restart it 1 minute later. on the test client a. setup netconsole to redirect log to monitoring node b. run the following commands in loop, continuosly fio --iodepth=32 --numjobs=8 --runtime=120 --ioengine=libaio --group_reporting --direct=1 --eta=always --name=job --bs=65536 --rw=100 --filename=/dev/rbd0 fio --iodepth=32 --numjobs=8 --runtime=120 --ioengine=libaio --group_reporting --direct=1 --eta=always --name=job --bs=65536 --rw=0 --filename=/dev/rbd0 we have run this for around 5 hours, 53 iterations, with no panics. crontab entry * * * * * root /path/to/cronjob === cron job === #!/bin/bash if [ $[`date +%M` % 4] == 0 ] then echo 'network stop' ifconfig eth0 down else echo 'network start' ifconfig eth0 up fi === cron job === === fio installation === apt-get install -y libaio* git clone git://git.kernel.dk/fio.git cd fio git checkout fio-2.0.3 make sudo make install On Tue, Sep 25, 2012 at 12:33 PM, Christian Huang <ythuang@xxxxxxxxx> wrote: > Hi Alex, > is this issue what you are referring to? > http://tracker.newdream.net/issues/2260 > > we will give the patch a try and see if resolves the issue. > > Best Regards. > Chris. > > On Tue, Sep 25, 2012 at 11:38 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >> On 09/24/2012 08:25 PM, Christian Huang wrote: >>> Hi Alex, >>> we have used several kernel versions, some built from source, >>> some stock kernel, from ubuntu repository. >>> >>> for the version you are referring to, we used a stock kernel from >>> ubuntu repository. >>> >>> for building from source, we follow instructions from this page >>> http://blog.avirtualhome.com/compile-linux-kernel-3-2-for-ubuntu-11-10/ >>> and use the following tag from precise git repo. >>> Ubuntu-3.2.0-29.46 >> >> These two bits of information: >> >>> please also note that we reproduced the issue with kernel 3.5.4 >>> from kernel ppa >>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >>> >>> it seems the following version does not have the issue >>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/ >> >> ...are very helpful. >> >> There is a very important bug that got fixed between those two >> releases, and it has symptoms like what you are reporting. >> I can't say with 100% confidence that you are hitting this, but >> it it appears you could be. >> >> The fix is very simple, and you should be able to patch your own >> code to check to see if it makes the problem go away. If you >> do, please report back whether you find it fixes the problem. >> >> Tomorrow I'll see if I can trace the particulars of the problem >> you are reporting to this issue. >> >> -Alex >> >> From 02f7c002c9af475df6b2a1b64066bcdaf53cb7dc Mon Sep 17 00:00:00 2001 >> From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> >> Date: Wed, 6 Jun 2012 19:35:55 -0500 >> Subject: [PATCH] rbd: Clear ceph_msg->bio_iter for retransmitted message >> >> The bug can cause NULL pointer dereference in write_partial_msg_pages >> >> Signed-off-by: Zheng Yan <zheng.z.yan@xxxxxxxxx> >> Reviewed-by: Alex Elder <elder@xxxxxxxxxxx> >> (cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037) >> --- >> net/ceph/messenger.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c >> index f0e34ff..d372b34 100644 >> --- a/net/ceph/messenger.c >> +++ b/net/ceph/messenger.c >> @@ -563,6 +563,10 @@ static void prepare_write_message(struct >> ceph_connection *con) >> m->hdr.seq = cpu_to_le64(++con->out_seq); >> m->needs_out_seq = false; >> } >> +#ifdef CONFIG_BLOCK >> + else >> + m->bio_iter = NULL; >> +#endif >> >> dout("prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n", >> m, con->out_seq, le16_to_cpu(m->hdr.type), >> -- >> 1.7.9.5 >> >> >> >> >>> Best Regards. >>> Chris. >>> On Tue, Sep 25, 2012 at 6:59 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>>> On 09/24/2012 05:23 AM, Christian Huang wrote: >>>>> Hi, >>>>> we met the following issue while testing ceph cluster HA. >>>>> Appreciate if anyone can shed some light. >>>>> could this be related to the configuration ? (ie, 2 OSD nodes only) >>>> >>>> It appears to me the kernel that was in use for the crash logs >>>> you provided was built from source. If that is the case, are you >>>> able to provide me with the precise commit id so I am sure to >>>> be working with the right code? >>>> >>>> Here is a line that leads me to that conclusion: >>>> >>>> [ 203.172114] Pid: 1901, comm: kworker/0:2 Not tainted 3.2.0-29-generic >>>> #46-Ubuntu Wistron Cloud Computing/P92TB2 >>>> >>>> If you wish I would be happy to work with one of the other versions >>>> of the code, but would prefer to also have crash information that >>>> matches the source code I'm looking at. Thank you. >>>> >>>> -Alex >>>> >>>> >>>>> Issue description: >>>>> ceph rbd client will kernel panic if an OSD server loses it's >>>>> network connectivity. >>>>> so far, we can reproduce it with certainty. >>>>> we have tried with the following kernels >>>>> a. Stock kernel from 12.04 (3.2 series) >>>>> 3.5 series, as suggested in a previous mail by Sage >>>>> b. 3.5.0-15 from quantal repo, >>>>> git://kernel.ubuntu.com/ubuntu/ubuntu-quantal.git, Ubuntu-3.5.0-15.22 >>>>> tag >>>>> c. v3.5.4-quantal, >>>>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/ >>>>> >>>>> Environment: >>>>> OS: Ubuntu 12.04 precise pangolin >>>>> Ceph configuration: >>>>> OSD nodes: 2 x 12 drives , 1 os drive, 11 are mapped to OSD >>>>> 0-10, 10GbE link >>>>> Monitor nodes: 3 x KVM virtual machines on ubuntu host. >>>>> test client: fresh install of Ubuntu 12.04.1 >>>>> Ceph version used: 0.48, 0.48.1, 0.48.2, 0.51 >>>>> all nodes have the same kernel version. >>>>> >>>>> steps to reproduce: >>>>> on the test client, >>>>> 1. load rbd modules >>>>> 2. create rbd device >>>>> 3. map rbd device >>>>> 4. use fio tool to create work load on the device, 8 threads is >>>>> used for workload >>>>> we have also tried with iometer, 8 workers, 32k 50/50, same results. >>>>> >>>>> on one of the OSD nodes, >>>>> 1. sudo ifconfig eth0 down #where eth0 is the primary interface >>>>> configured for ceph. >>>>> 2. within 30 seconds, the test client will panic. >>>>> >>>>> this happens when there is IO activity on the RBD device, and one >>>>> of the OSD nodes loses connectivity. >>>>> >>>>> The netconsole output is available available from the following >>>>> dropbox link, >>>>> zip: goo.gl/LHytr >>>>> >>>>> Best Regards >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html