Hi Ext3! I'm running about 30 dedicated MySQL machines under quite decent loads, and they are occassionally crashing. I've been logging console messages recently in an effort to find the cause, and some appear to be related to I perused your lists and found the message I'm replying to. If you don't mind, I've included messages and ksymoops from two crashes that I had recently. Both were different. I'm not sure if you have fixes for them in the new kernel, so I'll be upgrading a few machines tonight. I'm running 2.6.10 with the "data=journal" mount option. Is that the best / safest option for running with MySQL? In any case, I'm logging all console messages now, so hopefully I can have more ksymoops output for you soon enough. I've included the output for each below. Thank you for your time! Jason First Machine ("Scratchy") ========================== Assertion failure in __journal_drop_transaction() at fs/jbd/checkpoint.c:613: "transaction->t_forget == NULL" ------------[ cut here ]------------ kernel BUG at fs/jbd/checkpoint.c:613! invalid operand: 0000 [#1] SMP CPU: 2 EIP: 0060:[<c01f8404>] Not tainted VLI EFLAGS: 00010282 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) EIP is at __journal_drop_transaction+0x128/0x290 eax: 00000071 ebx: d1abf680 ecx: c04ea524 edx: 00000286 esi: f6877400 edi: 00000013 ebp: d1abf680 esp: f5e59dc0 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 1086, threadinfo=f5e58000 task=f6801a60) Stack: c04295a0 c0429764 c0429567 00000265 c0429805 f6877400 f188cf8c c01f71c1 f6877400 d1abf680 f6877400 f6877414 f6877414 00000000 f68774c0 f6877454 f687743c d1abf6b8 f6877414 f6877414 ed3021b8 f6877478 f5e58000 00000000 Call Trace: [<c01f71c1>] journal_commit_transaction+0xf09/0xf68 [<c0163281>] rcu_check_quiescent_state+0x55/0x64 [<c016328b>] rcu_check_quiescent_state+0x5f/0x64 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c015014b>] find_busiest_group+0xeb/0x2d8 [<c0150c3b>] scheduler_tick+0x443/0x450 [<c015c30c>] run_timer_softirq+0x150/0x158 [<c015882a>] __do_softirq+0x6a/0xd4 [<c01687b5>] irq_exit+0x2d/0x30 [<c014b5b2>] smp_apic_timer_interrupt+0xce/0xd4 [<c01411ec>] apic_timer_interrupt+0x1c/0x30 [<c015bcc3>] del_timer_sync+0xa3/0xdc [<c01f9013>] kjournald+0xd3/0x228 [<c01f8f40>] kjournald+0x0/0x228 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c01f8f30>] commit_timeout+0x0/0xc [<c013e849>] kernel_thread_helper+0x5/0xc Code: 95 42 c0 83 c4 14 90 83 7b 24 00 74 2a 68 05 98 42 c0 68 65 02 00 00 68 67 95 42 c0 68 64 97 42 c0 68 a0 95 42 c0 e8 50 c4 f5 ff <0f> 0b 65 02 67 95 42 c0 83 c4 14 90 83 7b 2c 00 74 2a 68 40 98 scratchy: 07:17pm# ksymoops -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189 /tmp/scratchy.apr30th.or.something.edited ksymoops 2.4.5 on i686 2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/ (default) -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189 (specified) Error (regular_file): read_ksyms stat /proc/ksyms failed ksymoops: No such file or directory No modules in ksyms, skipping objects No ksyms, skipping lsmod kernel BUG at fs/jbd/checkpoint.c:613! invalid operand: 0000 [#1] CPU: 2 EIP: 0060:[<c01f8404>] Not tainted VLI Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010282 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) eax: 00000071 ebx: d1abf680 ecx: c04ea524 edx: 00000286 esi: f6877400 edi: 00000013 ebp: d1abf680 esp: f5e59dc0 ds: 007b es: 007b ss: 0068 Stack: c04295a0 c0429764 c0429567 00000265 c0429805 f6877400 f188cf8c c01f71c1 f6877400 d1abf680 f6877400 f6877414 f6877414 00000000 f68774c0 f6877454 f687743c d1abf6b8 f6877414 f6877414 ed3021b8 f6877478 f5e58000 00000000 [<c01f71c1>] journal_commit_transaction+0xf09/0xf68 [<c0163281>] rcu_check_quiescent_state+0x55/0x64 [<c016328b>] rcu_check_quiescent_state+0x5f/0x64 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c015014b>] find_busiest_group+0xeb/0x2d8 [<c0150c3b>] scheduler_tick+0x443/0x450 [<c015c30c>] run_timer_softirq+0x150/0x158 [<c015882a>] __do_softirq+0x6a/0xd4 [<c01687b5>] irq_exit+0x2d/0x30 [<c014b5b2>] smp_apic_timer_interrupt+0xce/0xd4 [<c01411ec>] apic_timer_interrupt+0x1c/0x30 [<c015bcc3>] del_timer_sync+0xa3/0xdc [<c01f9013>] kjournald+0xd3/0x228 [<c01f8f40>] kjournald+0x0/0x228 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c01f8f30>] commit_timeout+0x0/0xc [<c013e849>] kernel_thread_helper+0x5/0xc Code: 95 42 c0 83 c4 14 90 83 7b 24 00 74 2a 68 05 98 42 c0 68 65 02 00 00 68 67 95 42 c0 68 64 97 42 c0 68 a0 95 42 c0 e8 50 c4 f5 ff <0f> 0b 65 02 67 95 42 c0 83 c4 14 90 83 7b 2c 00 74 2a 68 40 98 >>EIP; c01f8404 <__journal_drop_transaction+128/290> <===== >>ebx; d1abf680 <pg0+1147f680/3f9be400> >>ecx; c04ea524 <log_wait+4/c> >>esi; f6877400 <pg0+36237400/3f9be400> >>ebp; d1abf680 <pg0+1147f680/3f9be400> >>esp; f5e59dc0 <pg0+35819dc0/3f9be400> Code; c01f83d9 <__journal_drop_transaction+fd/290> 00000000 <_EIP>: Code; c01f83d9 <__journal_drop_transaction+fd/290> 0: 95 xchg %eax,%ebp Code; c01f83da <__journal_drop_transaction+fe/290> 1: 42 inc %edx Code; c01f83db <__journal_drop_transaction+ff/290> 2: c0 83 c4 14 90 83 7b rolb $0x7b,0x839014c4(%ebx) Code; c01f83e2 <__journal_drop_transaction+106/290> 9: 24 00 and $0x0,%al Code; c01f83e4 <__journal_drop_transaction+108/290> b: 74 2a je 37 <_EIP+0x37> c01f8410 <__journal_drop_transaction+134/290> Code; c01f83e6 <__journal_drop_transaction+10a/290> d: 68 05 98 42 c0 push $0xc0429805 Code; c01f83eb <__journal_drop_transaction+10f/290> 12: 68 65 02 00 00 push $0x265 Code; c01f83f0 <__journal_drop_transaction+114/290> 17: 68 67 95 42 c0 push $0xc0429567 Code; c01f83f5 <__journal_drop_transaction+119/290> 1c: 68 64 97 42 c0 push $0xc0429764 Code; c01f83fa <__journal_drop_transaction+11e/290> 21: 68 a0 95 42 c0 push $0xc04295a0 Code; c01f83ff <__journal_drop_transaction+123/290> 26: e8 50 c4 f5 ff call fff5c47b <_EIP+0xfff5c47b> c0154854 <printk+0/14> Code; c01f8404 <__journal_drop_transaction+128/290> <===== 2b: 0f 0b ud2a <===== Code; c01f8406 <__journal_drop_transaction+12a/290> 2d: 65 02 67 95 add %gs:0xffffff95(%edi),%ah Code; c01f840a <__journal_drop_transaction+12e/290> 31: 42 inc %edx Code; c01f840b <__journal_drop_transaction+12f/290> 32: c0 83 c4 14 90 83 7b rolb $0x7b,0x839014c4(%ebx) Code; c01f8412 <__journal_drop_transaction+136/290> 39: 2c 00 sub $0x0,%al Code; c01f8414 <__journal_drop_transaction+138/290> 3b: 74 2a je 67 <_EIP+0x67> c01f8440 <__journal_drop_transaction+164/290> Code; c01f8416 <__journal_drop_transaction+13a/290> 3d: 68 .byte 0x68 Code; c01f8417 <__journal_drop_transaction+13b/290> 3e: 40 inc %eax Code; c01f8418 <__journal_drop_transaction+13c/290> 3f: 98 cwtl 1 error issued. Results may not be reliable. scratchy: 07:17pm# Second Machine ("Tib") ====================== Unable to handle kernel NULL pointer dereference at virtual address 00000004 ^M printing eip: ^Mc01fab35 ^M*pgd = c040fa1800000000 ^M*pmd = 0000000000000000 ^MOops: 0000 [#1] ^MSMP ^MCPU: 2 ^MEIP: 0060:[<c01fab35>] Not tainted VLI ^MEFLAGS: 00010246 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) ^MEIP is at __journal_remove_journal_head+0x9/0x130 ^Meax: 00000000 ebx: 00000000 ecx: f7d4f200 edx: 00000014 ^Mesi: f1920320 edi: 00000013 ebp: c0d6f280 esp: f46d5dcc ^Mds: 007b es: 007b ss: 0068 ^MProcess kjournald (pid: 1091, threadinfo=f46d4000 task=f597ca60) ^MStack: f1920320 da3ee14c c01fac83 f1920320 f1920320 c01f70fb f1920320 f7d4f400 ^M f7d4f414 f7d4f414 00000000 f7d4f4c0 f7d4f454 f7d4f43c c0d6f2b8 f7d4f414 ^M f7d4f414 eba83db8 f7d4f478 f46d4000 00000000 00000ebc d7c53144 00000000 ^MCall Trace: ^M [<c01fac83>] journal_remove_journal_head+0x27/0x44 ^M [<c01f70fb>] journal_commit_transaction+0xe43/0xf68 ^M [<c0199b57>] d_callback+0x27/0x2c ^M [<c0165f6c>] autoremove_wake_function+0x0/0x40 ^M [<c0165f6c>] autoremove_wake_function+0x0/0x40 ^M [<c01f9013>] kjournald+0xd3/0x228 ^M [<c01f8f40>] kjournald+0x0/0x228 ^M [<c0165f6c>] autoremove_wake_function+0x0/0x40 ^M [<c0165f6c>] autoremove_wake_function+0x0/0x40 ^M [<c01f8f30>] commit_timeout+0x0/0xc ^M [<c013e849>] kernel_thread_helper+0x5/0xc ^MCode: 74 06 8b 5a 28 ff 43 04 8b 02 a9 00 00 10 00 75 08 0f 0b 19 02 c0 9b 42 c0 f0 0f ba 32 14 89 d8 5b c3 56 53 8b 74 24 0c 8b 5e 28 <83> 7b 04 00 7d 29 68 e0 a4 42 c0 68 e3 06 00 00 68 cc 9c 42 c0 tib: 07:00pm# ksymoops -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/root/oops ksymoops 2.4.5 on i686 2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/ (default) -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189(specified) Error (regular_file): read_ksyms stat /proc/ksyms failed ksymoops: No such file or directory No modules in ksyms, skipping objects No ksyms, skipping lsmod Unable to handle kernel NULL pointer dereference at virtual address 00000004 c01fab35 *pgd = c040fa1800000000 Oops: 0000 [#1] CPU: 2 EIP: 0060:[<c01fab35>] Not tainted VLI Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) eax: 00000000 ebx: 00000000 ecx: f7d4f200 edx: 00000014 esi: f1920320 edi: 00000013 ebp: c0d6f280 esp: f46d5dcc ds: 007b es: 007b ss: 0068 Stack: f1920320 da3ee14c c01fac83 f1920320 f1920320 c01f70fb f1920320 f7d4f400 f7d4f414 f7d4f414 00000000 f7d4f4c0 f7d4f454 f7d4f43c c0d6f2b8 f7d4f414 f7d4f414 eba83db8 f7d4f478 f46d4000 00000000 00000ebc d7c53144 00000000 [<c01fac83>] journal_remove_journal_head+0x27/0x44 [<c01f70fb>] journal_commit_transaction+0xe43/0xf68 [<c0199b57>] d_callback+0x27/0x2c [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c01f9013>] kjournald+0xd3/0x228 [<c01f8f40>] kjournald+0x0/0x228 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c0165f6c>] autoremove_wake_function+0x0/0x40 [<c01f8f30>] commit_timeout+0x0/0xc [<c013e849>] kernel_thread_helper+0x5/0xc Code: 74 06 8b 5a 28 ff 43 04 8b 02 a9 00 00 10 00 75 08 0f 0b 19 02 c0 9b 42 c0 f0 0f ba 32 14 89 d8 5b c3 56 53 8b 74 24 0c 8b 5e 28 <83> 7b 04 00 7d 29 68 e0 a4 42 c0 68 e3 06 00 00 68 cc 9c 42 c0 >>EIP; c01fab35 <__journal_remove_journal_head+9/130> <===== >>ecx; f7d4f200 <pg0+3770f200/3f9be400> >>esi; f1920320 <pg0+312e0320/3f9be400> >>ebp; c0d6f280 <pg0+72f280/3f9be400> >>esp; f46d5dcc <pg0+34095dcc/3f9be400> Code; c01fab0a <journal_grab_journal_head+2e/50> 00000000 <_EIP>: Code; c01fab0a <journal_grab_journal_head+2e/50> 0: 74 06 je 8 <_EIP+0x8> c01fab12 <journal_grab_journal_head+36/50> Code; c01fab0c <journal_grab_journal_head+30/50> 2: 8b 5a 28 mov 0x28(%edx),%ebx Code; c01fab0f <journal_grab_journal_head+33/50> 5: ff 43 04 incl 0x4(%ebx) Code; c01fab12 <journal_grab_journal_head+36/50> 8: 8b 02 mov (%edx),%eax Code; c01fab14 <journal_grab_journal_head+38/50> a: a9 00 00 10 00 test $0x100000,%eax Code; c01fab19 <journal_grab_journal_head+3d/50> f: 75 08 jne 19 <_EIP+0x19> c01fab23 <journal_grab_journal_head+47/50> Code; c01fab1b <journal_grab_journal_head+3f/50> 11: 0f 0b ud2a Code; c01fab1d <journal_grab_journal_head+41/50> 13: 19 02 sbb %eax,(%edx) Code; c01fab1f <journal_grab_journal_head+43/50> 15: c0 9b 42 c0 f0 0f ba rcrb $0xba,0xff0c042(%ebx) Code; c01fab26 <journal_grab_journal_head+4a/50> 1c: 32 14 89 xor (%ecx,%ecx,4),%dl Code; c01fab29 <journal_grab_journal_head+4d/50> 1f: d8 5b c3 fcomps 0xffffffc3(%ebx) Code; c01fab2c <__journal_remove_journal_head+0/130> 22: 56 push %esi Code; c01fab2d <__journal_remove_journal_head+1/130> 23: 53 push %ebx Code; c01fab2e <__journal_remove_journal_head+2/130> 24: 8b 74 24 0c mov 0xc(%esp,1),%esi Code; c01fab32 <__journal_remove_journal_head+6/130> 28: 8b 5e 28 mov 0x28(%esi),%ebx Code; c01fab35 <__journal_remove_journal_head+9/130> <===== 2b: 83 7b 04 00 cmpl $0x0,0x4(%ebx) <===== Code; c01fab39 <__journal_remove_journal_head+d/130> 2f: 7d 29 jge 5a <_EIP+0x5a> c01fab64 <__journal_remove_journal_head+38/130> Code; c01fab3b <__journal_remove_journal_head+f/130> 31: 68 e0 a4 42 c0 push $0xc042a4e0 Code; c01fab40 <__journal_remove_journal_head+14/130> 36: 68 e3 06 00 00 push $0x6e3 Code; c01fab45 <__journal_remove_journal_head+19/130> 3b: 68 cc 9c 42 c0 push $0xc0429ccc On Mon, May 02, 2005 at 06:17:35PM -0700, Jason Sydes wrote: > Mike Fedyk <mfedyk@xxxxxxxxxxxxx> writes: > > > Nicolas Kowalski wrote: > >> Mike Fedyk <mfedyk@xxxxxxxxxxxxx> writes: > >> > >>>Nicolas Kowalski wrote: > >>> > >>>>I will try to reproduce these errors on a non-production server now. > >>> > >>>Beautiful. > >>> > >>>It might be good if you put a stack_dump() call just after the > >>>printk() call in the ext3 source. > >> I apologize, (I am not familiar with kernel debugging), but when > >> compiling the kernel with this call inserted after the printk in the > >> sources, it fails with an resolved symbol error. ... > >> fs/fs.o: In function `__jbd_unexpected_dirty_buffer': > >> fs/fs.o(.text+0x3ab8a): undefined reference to `stack_dump' > >> ... > >> I must be missing an option, but which one ? > > > > Oh crap. It's called dump_stack(). > > Ok. I had another similar error this morning: > > Unexpected dirty buffer encountered at do_get_write_access:618 (08:11 > blocknr 920701) > dba1fddc dba1fe04 c017565e c03054a0 c0305483 c030373b 0000026a c03fc5e0 > 000e0c7d d1072580 dba1fe4c c016f76b c030373b 0000026a d34f1d80 > d1072580 > df4c1e94 d34f1d80 c01701dd 00000000 00000000 00000003 df4c1e00 > d3615430 > Call Trace: [<c017565e>] [<c016f76b>] [<c01701dd>] [<c016fc10>] > [<c0167c88>] > [<c0167c4e>] [<c0167e21>] [<c0167c74>] [<c012f67a>] [<c012fb14>] > [<c01657e2>] > [<c013c807>] [<c0108be3>] > > > ksymoops gives me: > > Trace; c017565e <__jbd_unexpected_dirty_buffer+3a/74> > Trace; c016f76b <do_get_write_access+10f/570> > Trace; c01701dd <journal_dirty_metadata+15d/188> > Trace; c016fc10 <journal_get_write_access+44/68> > Trace; c0167c88 <do_journal_get_write_access+14/58> > Trace; c0167c4e <walk_page_buffers+52/78> > Trace; c0167e21 <ext3_prepare_write+155/208> > Trace; c0167c74 <do_journal_get_write_access+0/58> > Trace; c012f67a <do_generic_file_write+226/408> > Trace; c012fb14 <generic_file_write+f4/10c> > Trace; c01657e2 <ext3_file_write+1e/9c> > Trace; c013c807 <sys_write+93/108> > Trace; c0108be3 <system_call+33/38> > > > Does this help ? > > -- > Nicolas > > > _______________________________________________ > > Ext3-users@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/ext3-users > _______________________________________________ Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users