Re: debugging hang on headless server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



В Ср, 05/07/2017 в 08:26 +0200, Alvaro Miranda пишет:
> Hi
> 
> So access to the server is required, if there are other servers in
> the network, then netconsole will send to the remote box over same
> syslog idea logs that get printed to the console
> 
> syslog for logs that go to the normal log files
> 
> and kdump will be able to generate a memory dump and some logs if
> there is any crash
> 
> with that you should be able to have ll the information outside this
> server when there is a hang
> 
> othertools that can be user, are sar/sysstat you can copy the files
> out for inspection and get an idea about performance and what was
> doing what before a hang
> 
> good luck
> alvaro
> 
> > On 5 Jul 2017, at 03:54, Alexandr <sss@xxxxxxxxxxxxxxx> wrote:
> > 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA512
> > 
> > В Вт, 04/07/2017 в 07:48 -0500, Billy Crook пишет:
> > > You can set up syslog to forward of to an external system.  That
> > > helps but
> > > sometimes it can miss things if the crash is severe enough.
> > > 
> > > Serial console can capture much more if there's another server
> > > near
> > > by that
> > > you can connect and leave minicom logging.  Even better if you
> > > have a
> > > nice
> > > physical "serial console server" like an OpenGear.
> > > 
> > > Most server motherboards these days contain some sort of
> > > management
> > > chip
> > > implementing ipmi and additional remote access features which
> > > often
> > > include
> > > remote KVM over IP.
> > > 
> > > Even the most basic ipmi controllers will include serial over LAN
> > > which
> > > let's do you use serial console without a server nearby to
> > > connect a
> > > physical serial cable.
> > > 
> > > If the hang is caused by a hardware problem you may find
> > > indication
> > > of it
> > > in the ipmi system event list.  'ipmitool sel list'
> > > 
> > > There's also a kernel module, netconsole, that forwards kernel
> > > messages off
> > > host in a syslog compatible format. It implements IP and UDP and
> > > its
> > > own
> > > code so it works independent of the system's TCP stack which is
> > > important
> > > during some crashes which might otherwise break networking before
> > > the
> > > messages get transmitted.
> > > 
> > > That's just a few ways.  There's probably others. Hope it helps
> > > 
> > > > On Jul 4, 2017 07:32, "Alexandr" <sss@xxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > Good day to all.
> > > > please suggest methods of system hang debugging on headless
> > > > server
> > > > ?
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > linux-
> > > > admin" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.h
> > > > tml
> > > > 
> > 
> > thx for reply.
> > it's self built server on regular desktop hardware, so have not
> > ipmi.
> > it's possible to have phisical access to it, so i guess serial
> > console
> > is my solution.
> > i am also have few questions about serial console setup.
> > as i understand simplest way to achieve it is default serial com
> > port,
> > i am also have read something about serial via usb (looks like it
> > require some sort of special adaptor)
> > 
> > also netconsole looks interesting too, i will setup it right now.
> > -----BEGIN PGP SIGNATURE-----
> > 
> > iQIzBAEBCgAdFiEEl/sA7WQg6czXWI/dsEApXVthX7wFAllcRvEACgkQsEApXVth
> > X7w5DA//VnlRsWDxy0RifqMrmBtiL1/fxdl3Wz/PhH/Fdu+AVgM2XdaITYHfI0uU
> > 7l6bYPPnMnXlmiqtJ0DkYEV+Xgq7Gkuw3JK8eppLHehgMfaIEmnXWPMoxVVZalmn
> > u1HVGBdEGfRtiLisrNvZvGgeCwXdGIfMeZbnBFwFbG3vbxzHzaGd7mPTO9CvhPyj
> > FArTHF+i3DtVpteXbww78JS0qEKLTh83cW1FcCqAWjZBvrnz35i7Egp4grajhhzQ
> > ki8wFUqwwVWY+JVg6UcpGr5CH4b11b4qv3Vol/ideooiTHWPv1xKXMjT3+DIggzc
> > gh0g6ZwK3IDcR2rtJ2lGD/UOi4jJNti8A6O/48sXIsgruePuDIPNJOIbBbRR5EDv
> > cDlb6vVHA0QO6XilUcTKFPEjg9yG++JIyYHmGrGxJGo28KpF+GbStA1xgm98Mbqj
> > AJP/mZIdLCoKuRGEGkPCvHAX4BfMm4zdPmBcj3XAiw9xGJCFuts0xowX9uSqnzT6
> > ITlVA9VHWlm030TFYp0h2HYjuzpVrKNw06OQ0AqHy7E932DlPr3us2y7ugWGXUgE
> > Sxewc2+QXR2RGEHXVWfbymGr6uKtqwmwuyl8q7QQ/oqYGFK99WblRMT+6+K0kJI6
> > f/YzGyAqHRtN2gBZvRjX2BYVPdPKcGbNKqp27B9wOkz58NNRpF8=
> > =PmkQ
> > -----END PGP SIGNATURE-----
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > admin" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
i have caught another hang, but now with netconsole i have a bit more
info.
i still do not understand why it hangs.

[64051.691781] cleanupd invoked oom-killer:
gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0,
oom_score_adj=0
[64051.693087] cleanupd cpuset=/ mems_allowed=0
[64051.694287] CPU: 2 PID: 1090 Comm: cleanupd Tainted:
G        W       4.11.8 #5
[64051.695471] Hardware name: System manufacturer System Product
Name/M4A77TD, BIOS 2104    06/28/2010
[64051.696647] Call Trace:
[64051.697802]  ? dump_stack+0x46/0x6d
[64051.698885]  ? dump_header+0x8f/0x1ff
[64051.700002]  ? mem_cgroup_scan_tasks+0x9b/0xc0
[64051.701006]  ? oom_kill_process+0x217/0x400
[64051.701972]  ? out_of_memory+0xf0/0x280
[64051.702900]  ? mem_cgroup_out_of_memory+0x36/0x60
[64051.703800]  ? mem_cgroup_oom_synchronize+0x2e3/0x300
[64051.704665]  ? __mem_cgroup_insert_exceeded+0x80/0x80
[64051.705499]  ? pagefault_out_of_memory+0x1c/0x60
[64051.706301]  ? do_page_fault+0x1b/0x60
[64051.707083]  ? do_syscall_64+0x5e/0x180
[64051.707839]  ? __context_tracking_exit+0x8/0x20
[64051.708572]  ? page_fault+0x22/0x30
[64051.709334] Task in /system.slice/smbd.service killed as a result of
limit of /system.slice/smbd.service
[64051.710134] memory: usage 524288kB, limit 524288kB, failcnt 186142
[64051.710937] memory+swap: usage 528164kB, limit 9007199254740988kB,
failcnt 0
[64051.711753] kmem: usage 7908kB, limit 9007199254740988kB, failcnt 0
[64051.712594] Memory cgroup stats for /system.slice/smbd.service:
cache:516380KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:515544KB
writeback:0KB swap:3876KB inactive_anon:0KB active_anon:0KB
inactive_file:258164KB active_file:258088KB unevictable:0KB
[64051.714413] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds
swapents oom_score_adj name
[64051.715447] [
1088]     0  1088    62115      699     121       3      484           
  0 smbd
[64051.716431] [
1089]     0  1089    60375      260     116       3      458           
  0 smbd-notifyd
[64051.717387] [
1090]     0  1090    60379        0     114       3      458           
  0 cleanupd
[64051.718372] [25466]  1000
25466    71519      600     128       3      629             0 smbd
[64051.719351] Memory cgroup out of memory: Kill process 25466 (smbd)
score 0 or sacrifice child
[64051.720330] Killed process 25466 (smbd) total-vm:286076kB, anon-
rss:0kB, file-rss:2400kB, shmem-rss:0kB
[64051.726998] oom_reaper: reaped process 25466 (smbd), now anon-
rss:0kB, file-rss:0kB, shmem-rss:0kB
[64261.584682] ------------[ cut here ]------------
[64261.588370] WARNING: CPU: 0 PID: 1033 at net/ipv4/tcp_input.c:2820
tcp_fastretrans_alert+0x8db/0xac0
[64261.592190] Modules linked in: algif_skcipher af_alg netconsole veth
ccm sch_pie sctp act_mirred ifb sch_ingress nf_conntrack_netlink
nfnetlink cls_u32 sch_sfq sch_htb sit tunnel4 ip_tunnel
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw
ip6_tables iptable_raw ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat
iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS xt_sctp ipt_REJECT
nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_recent
xt_conntrack nf_conntrack xt_multiport iptable_filter iptable_mangle
ip_tables x_tables radeon ath9k led_class ath9k_common ath9k_hw
i2c_algo_bit ttm drm_kms_helper mac80211 sch_fq_codel ath cfbfillrect
cfg80211 syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops
cfbcopyarea drm br_netfilter bridge snd_hda_codec_via
snd_hda_codec_generic snd_hda_intel snd_hda_codec
[64261.622008]  stp backlight llc rfkill r8169 xhci_pci snd_usb_audio
snd_hwdep snd_usbmidi_lib snd_hda_core snd_rawmidi ohci_pci xhci_hcd
snd_seq_device ohci_hcd parport_pc vhost_net snd_pcm i2c_piix4
snd_timer tun mii btrfs acpi_cpufreq button asus_atk0110 snd soundcore
vhost processor tap xor kvm_amd kvm irqbypass gspca_zc3xx gspca_main
v4l2_common k10temp hwmon raid6_pq uvcvideo videobuf2_vmalloc
videobuf2_memops videobuf2_v4l2 videodev videobuf2_core i2c_core
parport fbcon nfsd auth_rpcgss oid_registry nfs_acl bitblit lockd
softcursor fb grace sunrpc fbdev font ipv6 autofs4
[64261.647404] CPU: 0 PID: 1033 Comm: ml1 Tainted:
G        W       4.11.8 #5
[64261.652695] Hardware name: System manufacturer System Product
Name/M4A77TD, BIOS 2104    06/28/2010
[64261.657986] Call Trace:
[64261.663153]  <IRQ>
[64261.668267]  ? dump_stack+0x46/0x6d
[64261.673321]  ? __warn+0xb4/0xe0
[64261.678290]  ? tcp_fastretrans_alert+0x8db/0xac0
[64261.683298]  ? tcp_ack+0xd58/0x1300
[64261.688302]  ? tcp_rcv_established+0xf4/0x6a0
[64261.693313]  ? tcp_v4_do_rcv+0x115/0x200
[64261.698314]  ? tcp_v4_rcv+0xaef/0xb80
[64261.703302]  ? nf_nat_ipv4_fn+0x53/0x1a0 [nf_nat_ipv4]
[64261.708324]  ? ip_local_deliver_finish+0x87/0x1e0
[64261.713352]  ? ip_local_deliver+0x3d/0xc0
[64261.718349]  ? inet_del_offload+0x40/0x40
[64261.723318]  ? ip_rcv+0x281/0x380
[64261.728270]  ? ip_local_deliver_finish+0x1e0/0x1e0
[64261.733261]  ? __netif_receive_skb_core+0x49b/0x9c0
[64261.738259]  ? netif_receive_skb_internal+0x1a/0x80
[64261.743261]  ? ifb_ri_tasklet+0x16a/0x240 [ifb]
[64261.748188]  ? tasklet_action+0x8c/0xa0
[64261.753022]  ? __do_softirq+0xd4/0x200
[64261.757817]  ? irq_exit+0xe7/0x100
[64261.762552]  ? do_IRQ+0x45/0xc0
[64261.767163]  ? common_interrupt+0x89/0x89
[64261.771654]  </IRQ>
[64261.776269] ---[ end trace fa5d523840b633e2 ]---
[64552.218363] systemd[1]: systemd-journald.service: State 'stop-
sigabrt' timed out. Terminating.

so what happened here..
if i understand correctly, samba start eating more memory than allowed
(>512MB) and systemd service manager tried to kill it (it's should be
ok), but after this we have "[64552.218363] systemd[1]: systemd-
journald.service: State 'stop-sigabrt' timed out. Terminating.
" and hanged system.

"WARNING: CPU: 0 PID: 1033 at
net/ipv4/tcp_input.c:2820 tcp_fastretrans_alert+0x8db/0xac0" - this is
upstream regression which is unrelated i think.

any suggestions ?, looks like systemd problem ?
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Newbie]     [Audio]     [Hams]     [Kernel Newbies]     [Util Linux NG]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Device Drivers]     [Samba]     [Video 4 Linux]     [Git]     [Fedora Users]

  Powered by Linux