Lockd: failed to reclaim lock for pid ...

Dan Hyatt <dhyatt@xxxxxxxxxxxxxxxxx> · Tue, 18 Oct 2016 13:36:46 -0500

My environment is "heterogeneous" my authentication and home server are 
currently stuck on a 1G shared network, the production servers and 
storage servers are on a bonded 40G network, all are in the same VLAN. I 
have about 100 servers on the 40GB bonded network each with 12cores and 
128GB of memory.

They are running centos 6.6

Except for my storage servers they are all just running large and small 
research jobs on a grid engine.

Two questions:

The errors she seems to spawn is

lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)

and at some point, we start getting errors that the file locks are 
stuck.. you can write and read from the lockfile, but programs that 
depend on the C construct lock file throw filelock errors until we reboot.

Why is dmesg, /var/log/dmesg, and /var/log/messages  unique from each other?
I thought dmesg was a representation of /var/log/messages/

Is there a way to get a date stamp for the dmesg?  if a job failed in 
the last hour and the message is from yesterday...and I don't know that 
doesn't help.

I think what I am troubleshooting is THAT user who REFUSES to follow 
direction... and is sending  thousands of very large jobs which each 
might immediately spawn another 10-20 jobs to a grid of 100 servers in a 
matter of seconds overwhelming either the network or the home directory 
server or the authentication server... because when she strikes, 
sometimes users cannot get a response from LDAP or the home server 
within as much as 10 seconds.  Thus she breaks the NFS because it gets 
hammered and I have to restart all the servers on my grid.

We have had problems with "out of memory errors" due to her programs in 
the recent past and had to restart all 100 servers.

*/var/adm/messages gives this*

Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_result() failed: 
Can't contact LDAP server
Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_abandon() failed 
to abandon search: Other (e.g., implementation specific) error
Oct 18 13:27:14 blade5-2-1 nslcd[2520]: [e01acb] ldap_result() failed: 
Can't contact LDAP server
Oct 18 13:27:30 blade5-2-1 nslcd[2520]: [8c7a8f] ldap_result() failed: 
Can't contact LDAP server

*dmesg gives these*

lockd: server home not responding, still trying
lockd: server home OK

lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)

*/var/log/dmesg gives this*

pmi_si: probing via SMBIOS
ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 10
ipmi_si: Adding SMBIOS-specified kcs state machine
ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca8, 
slave address 0x20, irq 10
(NULL device *): The BMC does not support setting the recv irq bit, 
compensating, but the BMC needs to be fixed.
IRQ 10/ipmi_si: IRQF_DISABLED is not guaranteed on shared IRQs
ipmi_si ipmi_si.0: Using irq 10
ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, 
dev_id: 0x20)
ipmi_si ipmi_si.0: IPMI kcs interface initialized
ACPI: No handler for Region [SYSI] (ffff882029e57348) [IPMI]
power_meter ACPI000D:00: Found ACPI power meter.
ipmi device interface
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts:
Adding 121724924k swap on /dev/mapper/vg_server-lv_swap. Priority:-1 
extents:1 across:121724924

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos