We've built a reasonably large GFS in order to nearline our production
servers. It's currently about 45Tb total and 80% full. The only
unusual(ish) component to what we're storing there is a huge number of
hard links.
We're using dirvish in order to maintain incremental backups which uses
rsync to create hard links of the files to reduce the storage requirements.
Each night, it runs an expire command that deletes all files older than
10 days. This runs into several million files which need to be unlinked
each day.
Pretty much every night, one of our two servers attached to the GFS will
crash and stack trace with the below output. It appears that the unlink
command is where it's going wrong. It has occasionally occurred during
the rsyncs (which run simultaneously) as well (although less often) so
we're wondering whether there's some issue with locking/linking.
Has anybody seen similar issues or have any good ideas?
Can we provide any more useful information?
Stephen
lock_dlm: Assertion failed on line 428 of file
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4435245392
gfs1: num=2,8c8d370d err=-22 cur=3 req=5 lkf=44
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]
nfsd parport_pc i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}RSP: 0018:000001000d317b88
EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0
Process rsync (pid: 1888, threadinfo 000001000d316000, task
000001005d6ee030)
Stack:
<ffffffff8011d5cc>{flat_send_IPI_mask+0}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}<ffffffffa01e507d>{:gfs:gfs_link+149}
<ffffffff801813ac>{vfs_link+308} <ffffffff801814a1>{sys_link+158}
<ffffffff801101be>{system_call+126}
Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001000d317b88>
Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport
netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman
ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod
button battery ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001000d317b88 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0
Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffff8011d5cc>{flat_send_IPI_mask+0}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e507d>{:gfs:gfs_link+149}
<ffffffff801813ac>{vfs_link+308}
<ffffffff801814a1>{sys_link+158} <ffffffff801101be>{system_call+126}
...
...
lock_dlm: Assertion failed on line 428 of file
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4438575995
gfs1: num=2,23fc019f err=-22 cur=3 req=5 lkf=44
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]
parport_pc i2c_dev i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0
Process rm (pid: 5463, threadinfo 000001001a6de000, task 00000100d45b4030)
Stack: 2020202020202020
0000000000000003
0000000000000001 0000000000000000
Call Trace:<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e53d5>{:gfs:gfs_unlink+133}
<ffffffff80180ec0>{vfs_unlink+439}
<ffffffff80180fc5>{sys_unlink+185}
<ffffffff80183707>{sys_getdents64+166}
<ffffffff801101be>{system_call+126}
Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001001a6dfc18>
Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport
netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman
ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod
button battery ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0
Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e53d5>{:gfs:gfs_unlink+133}
<ffffffff80180ec0>{vfs_unlink+439} <ffffffff80180fc5>{sys_unlink+185}
<ffffffff80183707>{sys_getdents64+166}
<ffffffff801101be>{system_call+126}
...
...
--
Stephen Willey
Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com
--- Begin Message ---
An shortened example of a linking crash (rsync) and an unlinking crash
(rm).
Daire
lock_dlm: Assertion failed on line 428 of file /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4435245392
gfs1: num=2,8c8d370d err=-22 cur=3 req=5 lkf=44
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]
nfsd parport_pc i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}RSP: 0018:000001000d317b88 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0
Process rsync (pid: 1888, threadinfo 000001000d316000, task 000001005d6ee030)
Stack:
<ffffffff8011d5cc>{flat_send_IPI_mask+0}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50} <ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668} <ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}<ffffffffa01e507d>{:gfs:gfs_link+149}
<ffffffff801813ac>{vfs_link+308} <ffffffff801814a1>{sys_link+158}
<ffffffff801101be>{system_call+126}
Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001000d317b88>
Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
Pid: 1888, comm: rsync Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001000d317b88 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003b690
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001006f70b980 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 00000100f4d1fc00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a955863a0(0000) GS:ffffffff804b6040(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a985a1008 CR3: 0000000000101000 CR4: 00000000000006e0
Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} <ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffff8011d5cc>{flat_send_IPI_mask+0} <ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357} <ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938} <ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e507d>{:gfs:gfs_link+149} <ffffffff801813ac>{vfs_link+308}
<ffffffff801814a1>{sys_link+158} <ffffffff801101be>{system_call+126}
...
...
lock_dlm: Assertion failed on line 428 of file /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4438575995
gfs1: num=2,23fc019f err=-22 cur=3 req=5 lkf=44
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1]
parport_pc i2c_dev i2c_core cman pcmcia_core dm_mirror battery tg3 sd_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0
Process rm (pid: 5463, threadinfo 000001001a6de000, task 00000100d45b4030)
Stack: 2020202020202020
0000000000000003
0000000000000001 0000000000000000
Call Trace:<ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214} <ffffffffa01d8595>{:gfs:gfs_lm_lock+50}
<ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357} <ffffffffa01cd985>{:gfs:run_queue+668}
<ffffffffa01ce977>{:gfs:gfs_glock_nq+938} <ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416}
<ffffffffa01e53d5>{:gfs:gfs_unlink+133} <ffffffff80180ec0>{vfs_unlink+439}
<ffffffff80180fc5>{sys_unlink+185} <ffffffff80183707>{sys_getdents64+166}
<ffffffff801101be>{system_call+126}
Code: 0f 0b d8 ce 20 a0 ff ff ff ff ac 01 48 c7 c7 dd ce 20 a0 31
RIP <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} RSP <000001001a6dfc18>
Modules linked in: nfs nfsd exportfs lockd nfs_acl parport_pc lp parport netconsole netdump i2c_dev i2c_core lock_dlm gfs lock_harness dlm cman ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac ohci_hcd hw_random shpchp tg3 qla2400 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
Pid: 5463, comm: rm Not tainted 2.6.9-34smp
RIP: 0010:[<ffffffffa0208956>] <ffffffffa0208956>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:000001001a6dfc18 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000006f46
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff8038e160
RBP: 000001009cdb34c0 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80209c6f R12: 0000010037c61200
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a95585b00(0000) GS:ffffffff804b60c0(0000) knlGS:00000000f7fce6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9759afe8 CR3: 00000000f5798000 CR4: 00000000000006e0
Call Trace:<ffffffffa0208956>{:lock_dlm:do_dlm_lock+365} <ffffffffa0208a92>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa01d8595>{:gfs:gfs_lm_lock+50} <ffffffffa01cf683>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa01cd985>{:gfs:run_queue+668} <ffffffffa01ce977>{:gfs:gfs_glock_nq+938}
<ffffffffa01ceb6f>{:gfs:gfs_glock_nq_m+416} <ffffffffa01e53d5>{:gfs:gfs_unlink+133}
<ffffffff80180ec0>{vfs_unlink+439} <ffffffff80180fc5>{sys_unlink+185}
<ffffffff80183707>{sys_getdents64+166} <ffffffff801101be>{system_call+126}
...
...
--- End Message ---
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster