On Thu, 31 Jul 2014, James Eckersall wrote: > Ah, thanks for the clarification on that.We are very close to the 250 limit, > so that is something we'll have to look at addressing, but I don't think > it's actually relevant to the panics as since reverting the auth key changes > I made appears to have resolved the issue (no panics yet - 20 hours ish and > counting). My best guess is that the change in the capability changed the size of the ticket and overran a buffer, corrupting the heap somewhere. Can you share what the content of the cap was both before and after? (ceph auth dump, but just the 'cap' lines... not the secret keys :). Thanks! sage > > Now to figure out the best way to get a 3.14 kernel in Ubuntu Trusty :) > > > On 31 July 2014 10:23, Christian Balzer <chibi at gol.com> wrote: > On Thu, 31 Jul 2014 10:13:11 +0100 James Eckersall wrote: > > > Hi, > > > > I thought the limit was in relation to ceph and that 0.80+ fixed > that > > limit > > - or at least raised it to 4096? > > > Yes and yes. But 0.80 only made it into kernels 3.14 and beyond. ^o^ > > > If there is a 250 limit, can you confirm where this is documented? > > > In this very ML, see the "v0.75 released" thread: > --- > On Thu, 16 Jan 2014 15:51:17 +0200 Ilya Dryomov wrote: > > > On Wed, Jan 15, 2014 at 5:42 AM, Sage Weil <sage at inktank.com> wrote: > > > > > > [...] > > > > > > * rbd: support for 4096 mapped devices, up from ~250 (Ilya > Dryomov) > > > > Just a note, v0.75 simply adds some of the infrastructure, the > actual > > support for this will arrive with kernel 3.14. ?The theoretical > limit > > is 65536 mapped devices, although I admit I haven't tried mapping > more > > than ~4000 at once. > > > --- > > > Christian > > > Thanks > > > > J > > > > > > On 31 July 2014 09:50, Christian Balzer <chibi at gol.com> wrote: > > > > > > > > Hello, > > > > > > are you per-chance approaching the maximum amount of kernel > mappings, > > > which is somewhat shy of 250 in any kernel below 3.14? > > > > > > If you can easily upgrade to 3.14 see if that fixes it. > > > > > > Christian > > > > > > On Thu, 31 Jul 2014 09:37:05 +0100 James Eckersall wrote: > > > > > > > Hi, > > > > > > > > The stacktraces are very similar. ?Here is another one with > complete > > > > dmesg: http://pastebin.com/g3X0pZ9E > > > > > > > > The rbd's are mapped by the rbdmap service on boot. > > > > All our ceph servers are running Ubuntu 14.04 (kernel > > > > 3.13.0-30-generic). Ceph packages are from the Ubuntu repos, > version > > > > 0.80.1-0ubuntu1.1. I should have probably mentioned this info in > the > > > > initial mail :) > > > > > > > > This problem also seemed to get gradually worse over time. > > > > We had a couple of sporadic crashes at the start of the week, > > > > escalating to the node being unable to stay up for more than a > > > > couple of minutes before panicking. > > > > > > > > Thanks > > > > > > > > J > > > > > > > > > > > > On 31 July 2014 09:12, Ilya Dryomov <ilya.dryomov at inktank.com> > wrote: > > > > > > > > > On Thu, Jul 31, 2014 at 11:44 AM, James Eckersall > > > > > <james.eckersall at gmail.com> wrote: > > > > > > Hi, > > > > > > > > > > > > I've had a fun time with ceph this week. > > > > > > We have a cluster with 4 OSD (20 OSD's per) servers, 3 mons > and a > > > > > > server mapping ~200 rbd's and presenting cifs shares. > > > > > > > > > > > > We're using cephx and the export node has its own cephx auth > key. > > > > > > > > > > > > I made a change to the key last week, adding rwx access to > > > > > > another pool. > > > > > > > > > > > > Since that point, we had sporadic kernel panics of the > export > > > > > > node. > > > > > > > > > > > > It got to the point where it would barely finish booting up > and > > > > > > would > > > > > panic. > > > > > > > > > > > > Once I removed the extra pool I had added to the auth key, > it > > > > > > hasn't > > > > > crashed > > > > > > again. > > > > > > > > > > > > I'm a bit concerned that a change to an auth key can cause > this > > > > > > type of crash. > > > > > > There were no log entries on mon/osd/export node regarding > the > > > > > > key at > > > > > all, > > > > > > so it was only by searching my memory for what had changed > that > > > > > > allowed > > > > > me > > > > > > to resolve the problem. > > > > > > > > > > > > From what I could tell from the key, the format was correct > and > > > > > > the pool that I added did exist, so I am confused as to how > this > > > > > > would have caused kernel panics. > > > > > > > > > > > > Below is an example of one of the crash stacktraces. > > > > > > > > > > > > [ ? 32.713504] general protection fault: 0000 [#1] SMP > > > > > > [ ? 32.724718] Modules linked in: ipt_REJECT xt_tcpudp > > > > > > iptable_filter ip_tables x_tables rbd libceph libcrc32c > gpio_ich > > > > > > dcdbas intel_rapl x86_pkg_temp_thermal intel_powerclamp > coretemp > > > > > > kvm_intel kvm crct10dif_pclmul joydev crc32_pclmul > > > > > > ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul > > > > > > glue_helper ablk_helper cryptd sb_edac edac_core shpchp > lpc_ich > > > > > > mei_me mei wmi ipmi_si mac_hid acpi_power_meter 8021q garp > stp > > > > > > mrp llc bonding lp parport nfsd auth_rpcgss nfs_acl nfs > lockd > > > > > > sunrpc fscache hid_generic igb ixgbe i2c_algo_bit usbhid dca > hid > > > > > > ptp ahci > > > > > libahci > > > > > > pps_core megaraid_sas mdio > > > > > > [ ? 32.843936] CPU: 18 PID: 5030 Comm: tr Not tainted > > > > > > 3.13.0-30-generic #54-Ubuntu > > > > > > [ ? 32.860163] Hardware name: Dell Inc. PowerEdge > R620/0PXXHP, > > > > > > BIOS 1.6.0 03/07/2013 > > > > > > [ ? 32.876774] task: ffff880417b15fc0 ti: ffff8804273f4000 > > > > > > task.ti: ffff8804273f4000 > > > > > > [ ? 32.893384] RIP: 0010:[<ffffffff811a19c5>] > > > > > > [<ffffffff811a19c5>] kmem_cache_alloc+0x75/0x1e0 > > > > > > [ ? 32.912198] RSP: 0018:ffff8804273f5d40 ?EFLAGS: 00010286 > > > > > > [ ? 32.924015] RAX: 0000000000000000 RBX: 0000000000000000 > RCX: > > > > > > 00000000000011ed > > > > > > [ ? 32.939856] RDX: 00000000000011ec RSI: 00000000000080d0 > RDI: > > > > > > ffff88042f803700 > > > > > > [ ? 32.955696] RBP: ffff8804273f5d70 R08: 0000000000017260 > R09: > > > > > > ffffffff811be63c > > > > > > [ ? 32.971559] R10: 8080808080808080 R11: 0000000000000000 > R12: > > > > > > 7d10f8ec0c3cb928 > > > > > > [ ? 32.987421] R13: 00000000000080d0 R14: ffff88042f803700 > R15: > > > > > > ffff88042f803700 > > > > > > [ ? 33.003284] FS: ?0000000000000000(0000) > > > > > > GS:ffff88042fd20000(0000) knlGS:0000000000000000 > > > > > > [ ? 33.021281] CS: ?0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > > > > > > [ ? 33.034068] CR2: 00007f01a8fced40 CR3: 000000040e52f000 > CR4: > > > > > > 00000000000407e0 > > > > > > [ ? 33.049929] Stack: > > > > > > [ ? 33.054456] ?ffffffff811be63c 0000000000000000 > > > > > > ffff88041be52780 ffff880428052000 > > > > > > [ ? 33.071259] ?ffff8804273f5f2c 00000000ffffff9c > > > > > > ffff8804273f5d98 ffffffff811be63c > > > > > > [ ? 33.088084] ?0000000000000080 ffff8804273f5f2c > > > > > > ffff8804273f5e40 ffff8804273f5e30 > > > > > > [ ? 33.104908] Call Trace: > > > > > > [ ? 33.110399] ?[<ffffffff811be63c>] ? > get_empty_filp+0x5c/0x180 > > > > > > [ ? 33.123188] ?[<ffffffff811be63c>] > get_empty_filp+0x5c/0x180 > > > > > > [ ? 33.135593] ?[<ffffffff811cc03d>] path_openat+0x3d/0x620 > > > > > > [ ? 33.147422] ?[<ffffffff811cd47a>] do_filp_open+0x3a/0x90 > > > > > > [ ? 33.159250] ?[<ffffffff811a1985>] ? > > > > > > kmem_cache_alloc+0x35/0x1e0 [ ? 33.172405] > > > > > > [<ffffffff811cc6bf>] ? getname_flags+0x4f/0x190 [ ? > 33.185004] > > > > > > [<ffffffff811da237>] ? __alloc_fd+0xa7/0x130 [ ? 33.197025] > > > > > > [<ffffffff811bbb99>] do_sys_open+0x129/0x280 [ ? 33.209049] > > > > > > [<ffffffff81020d25>] ? syscall_trace_enter+0x145/0x250 > > > > > > [ ? 33.222992] ?[<ffffffff811bbd0e>] SyS_open+0x1e/0x20 > > > > > > [ ? 33.234053] ?[<ffffffff8172aeff>] tracesys+0xe1/0xe6 > > > > > > [ ? 33.245112] Code: dc 00 00 49 8b 50 08 4d 8b 20 49 8b 40 > 10 > > > > > > 4d 85 e4 > > > > > 0f > > > > > > 84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d > 4a > > > > > > 01 4d 8b > > > > > 06 > > > > > > <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 > 49 63 > > > > > > [ ? 33.292549] RIP ?[<ffffffff811a19c5>] > > > > > > kmem_cache_alloc+0x75/0x1e0 [ ? 33.306192] ?RSP > > > > > > <ffff8804273f5d40> > > > > > > > > > > Hi James, > > > > > > > > > > Are all the stacktraces the same? ?When are those rbd images > > > > > mapped - during > > > > > boot with some sort of init script? ?Can you attach the entire > > > > > dmesg? > > > > > > > > > > Thanks, > > > > > > > > > > ? ? ? ? ? ? ? ? Ilya > > > > > > > > > > > > > > -- > > > Christian Balzer ? ? ? ?Network/Systems Engineer > > > chibi at gol.com ? ? ? ? ? Global OnLine Japan/Fusion Communications > > > http://www.gol.com/ > > > > > > -- > Christian Balzer ? ? ? ?Network/Systems Engineer > chibi at gol.com ? ? ? ? ? Global OnLine Japan/Fusion Communications > http://www.gol.com/ > > > >