Re: 5.4.20 - high load - lots of incoming data - small data read.

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 06 Apr 2020 07:58:25 -0400

On Mon, 2020-04-06 at 12:23 +0200, Jesper Krogh wrote:
> It also emits this one - quite a bit later (not sure thats relevant)
> 
> [43364.349446] kworker/3:6: page allocation failure: order:0,
> mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0

It failed to allocate a single page (order:0) to handle a receive. This
is a GFP_ATOMIC allocation which means that it's not allowed to sleep
(and thus can't do any active reclaim).

> [43364.349459] CPU: 3 PID: 23433 Comm: kworker/3:6 Not tainted 5.4.30 #5
> [43364.349460] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [43364.349515] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> [43364.349516] Call Trace:
> [43364.349523]  <IRQ>
> [43364.349594]  dump_stack+0x6d/0x95
> [43364.349614]  warn_alloc+0x10c/0x170
> [43364.349616]  __alloc_pages_slowpath+0xe6c/0xef0
> [43364.349634]  ? ip_local_deliver_finish+0x48/0x50
> [43364.349635]  ? ip_local_deliver+0x6f/0xe0
> [43364.349640]  ? tcp_v4_early_demux+0x11c/0x150
> [43364.349641]  __alloc_pages_nodemask+0x2f3/0x360
> [43364.349643]  page_frag_alloc+0x129/0x150
> [43364.349654]  __napi_alloc_skb+0x86/0xd0
> [43364.349670]  page_to_skb+0x67/0x350 [virtio_net]
> [43364.349672]  receive_buf+0xe47/0x16c0 [virtio_net]
> [43364.349675]  virtnet_poll+0xf2/0x364 [virtio_net]
> [43364.349679]  net_rx_action+0x265/0x3e0
> [43364.349691]  __do_softirq+0xf9/0x2aa
> [43364.349701]  irq_exit+0xae/0xb0
> [43364.349705]  do_IRQ+0x59/0xe0
> [43364.349706]  common_interrupt+0xf/0xf
> [43364.349707]  </IRQ>
> [43364.349713] RIP: 0010:kvm_clock_get_cycles+0xc/0x20
> [43364.349715] Code: c3 48 c1 e1 06 31 c0 48 81 c1 00 10 60 84 49 89
> 0c 10 eb be 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 65 48 8b 3d e4
> 8f 7a 7d <e8> ff 0f 00 00 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00
> 55 48
> [43364.349716] RSP: 0018:ffffc20941f4bab8 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffdb
> [43364.349717] RAX: ffffffff8286e040 RBX: 0000277062955ccd RCX: 000000000000056a
> [43364.349735] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffffffff846010c0
> [43364.349750] RBP: ffffc20941f4bab8 R08: 0000000000008f3e R09: 000000000000097a
> [43364.349751] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000000
> [43364.349751] R13: 00000000004a63a2 R14: ffff9c1ac83665b4 R15: 0000000000000000
> [43364.349753]  ? kvmclock_setup_percpu+0x80/0x80
> [43364.349760]  ktime_get+0x3e/0xa0
> [43364.349763]  tcp_mstamp_refresh+0x12/0x40
> [43364.349764]  tcp_rcv_space_adjust+0x22/0x1d0
> [43364.349766]  tcp_recvmsg+0x28b/0xbc0
> [43364.349777]  ? aa_sk_perm+0x43/0x190
> [43364.349781]  inet_recvmsg+0x64/0xf0
> [43364.349786]  sock_recvmsg+0x66/0x70
> [43364.349791]  ceph_tcp_recvpage+0x79/0xb0 [libceph]
> [43364.349796]  read_partial_message+0x3c3/0x7c0 [libceph]
> [43364.349801]  ceph_con_workfn+0xa6a/0x23d0 [libceph]
> [43364.349809]  process_one_work+0x167/0x400
> [43364.349810]  worker_thread+0x4d/0x460
> [43364.349814]  kthread+0x105/0x140
> [43364.349815]  ? rescuer_thread+0x370/0x370
> [43364.349816]  ? kthread_destroy_worker+0x50/0x50
> [43364.349817]  ret_from_fork+0x35/0x40
> [43364.349845] Mem-Info:
> [43364.349849] active_anon:10675 inactive_anon:12484 isolated_anon:0
>                 active_file:14532 inactive_file:3900471 isolated_file:8
>                 unevictable:0 dirty:1 writeback:0 unstable:0
>                 slab_reclaimable:68339 slab_unreclaimable:36220
>                 mapped:11089 shmem:608 pagetables:1093 bounce:0
>                 free:25551 free_pcp:3443 free_cma:0
> [43364.349851] Node 0 active_anon:42700kB inactive_anon:49936kB
> active_file:58128kB inactive_file:15601884kB unevictable:0kB
> isolated(anon):0kB isolated(file):32kB mapped:44356kB dirty:4kB
> writeback:0kB shmem:2432kB shmem_thp: 0kB shmem_pmdmapped: 0kB
> anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [43364.349852] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB writepending:0kB present:15992kB managed:15908kB
> mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB
> [43364.349857] lowmem_reserve[]: 0 3444 15930 15930 15930
> [43364.349859] Node 0 DMA32 free:55240kB min:14596kB low:18244kB
> high:21892kB active_anon:1556kB inactive_anon:188kB active_file:48kB
> inactive_file:3430152kB unevictable:0kB writepending:0kB
> present:3653608kB managed:3588072kB mlocked:0kB kernel_stack:0kB
> pagetables:0kB bounce:0kB free_pcp:3568kB local_pcp:248kB free_cma:0kB
> [43364.349861] lowmem_reserve[]: 0 0 12485 12485 12485
> [43364.349862] Node 0 Normal free:31056kB min:83636kB low:96864kB
> high:110092kB active_anon:41144kB inactive_anon:49748kB
> active_file:58080kB inactive_file:12171732kB unevictable:0kB
> writepending:4kB present:13107200kB managed:12793540kB mlocked:0kB
> kernel_stack:7472kB pagetables:4372kB bounce:0kB free_pcp:10204kB
> local_pcp:1456kB free_cma:0kB
> [43364.349864] lowmem_reserve[]: 0 0 0 0 0
> [43364.349865] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB
> (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB
> (M) = 15908kB
> [43364.349877] Node 0 DMA32: 22*4kB (EH) 18*8kB (UMEH) 0*16kB 13*32kB
> (UME) 17*64kB (UM) 18*128kB (UMEH) 14*256kB (UEH) 5*512kB (UE)
> 24*1024kB (UM) 10*2048kB (M) 0*4096kB = 55240kB
> [43364.349894] Node 0 Normal: 94*4kB (UMEH) 47*8kB (MEH) 60*16kB
> (UMEH) 38*32kB (UMEH) 30*64kB (MEH) 21*128kB (MEH) 37*256kB (UM)
> 28*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 31344kB

Here though, it looks like there is quite a bit of free memory, so I'm
not sure why the allocation failed. It's possible that at the time that
the allocation failed there was nothing free, but just afterward quite a
bit of memory did become free (before the kernel could print out the
summary).

It may also be indicative of something else being wrong. If this
happened much later, then it may just be fallout from whatever the
original problem was.

> [43364.349920] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [43364.349927] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> [43364.349928] 3915698 total pagecache pages
> [43364.349930] 79 pages in swap cache
> [43364.349931] Swap cache stats: add 393, delete 314, find 38/80
> [43364.349932] Free swap  = 4036080kB
> [43364.349932] Total swap = 4038652kB
> [43364.349933] 4194200 pages RAM
> [43364.349933] 0 pages HighMem/MovableOnly
> [43364.349934] 94820 pages reserved
> [43364.349934] 0 pages cma reserved
> [43364.349934] 0 pages hwpoisoned
> 
> On Mon, Apr 6, 2020 at 11:18 AM Jesper Krogh <jesper.krogh@xxxxxxxxx> wrote:
> > And it looks like I can reproduce this pattern - and when it has been
> > stuck for "sufficient" amount of time - then it gets blacklisted by
> > the MDS/OSD and (a related) issue is that I cannot get the mountpoint
> > back without a reboot on the kernel-client side.
> > 
> > On Mon, Apr 6, 2020 at 11:09 AM Jesper Krogh <jesper.krogh@xxxxxxxxx> wrote:
> > > home-directory style - median is about 3KB - but varies greatly.
> > > 
> > > I also get this:
> > > [41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
> > > 120 seconds.
> > > [41204.865955]       Not tainted 5.4.30 #5
> > > [41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
> > > [41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
> > > [41204.866120] Call Trace:
> > > [41204.866156]  __schedule+0x45f/0x710
> > > [41204.866162]  ? xas_store+0x391/0x5f0
> > > [41204.866164]  schedule+0x3e/0xa0
> > > [41204.866166]  io_schedule+0x16/0x40
> > > [41204.866180]  __lock_page+0x12a/0x1d0
> > > [41204.866182]  ? file_fdatawait_range+0x30/0x30
> > > [41204.866187]  truncate_inode_pages_range+0x52c/0x980
> > > [41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
> > > [41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
> > > [41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
> > > [41204.866209]  truncate_inode_pages_final+0x4c/0x60
> > > [41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
> > > [41204.866219]  evict+0xca/0x1a0
> > > [41204.866221]  iput+0x1ba/0x210
> > > [41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
> > > [41204.866232]  process_one_work+0x167/0x400
> > > [41204.866233]  worker_thread+0x4d/0x460
> > > [41204.866236]  kthread+0x105/0x140
> > > [41204.866237]  ? rescuer_thread+0x370/0x370
> > > [41204.866239]  ? kthread_destroy_worker+0x50/0x50
> > > [41204.866240]  ret_from_fork+0x35/0x40
> > > 
> > > On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > > > On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@xxxxxxxxx> wrote:
> > > > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > > > and transport data to the tape-library from CephFS - below 2 threads is
> > > > > essentially doing something equivalent to
> > > > > 
> > > > > find /cephfs/ -type f | xargs cat | nc server
> > > > > 
> > > > > 2 threads only, load exploding and the "net read vs net write" has
> > > > > more than 100x difference.
> > > > > 
> > > > > Can anyone explain this as "normal" behaviour?
> > > > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > > > 
> > > > > jk@wombat:~$ w
> > > > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$ dstat -ar
> > > > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > > > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > > > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > > > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > > > 0     0 ^C
> > > > > jk@wombat:~$ w
> > > > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$
> > > > > 
> > > > > Thanks.
> > > > 
> > > > how small these files are?

-- 
Jeff Layton <jlayton@xxxxxxxxxx>