On 9/3/20 1:58 PM, Veronika Kabatova wrote: > > > ----- Original Message ----- >> From: "Rachel Sibley" <rasibley@xxxxxxxxxx> >> To: "Jens Axboe" <axboe@xxxxxxxxx>, "CKI Project" <cki-project@xxxxxxxxxx>, linux-block@xxxxxxxxxxxxxxx >> Cc: "Changhui Zhong" <czhong@xxxxxxxxxx> >> Sent: Thursday, September 3, 2020 8:59:48 PM >> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) >> >> >> >> On 9/3/20 1:46 PM, Jens Axboe wrote: >>> On 9/3/20 11:10 AM, Rachel Sibley wrote: >>>> >>>> On 9/3/20 1:07 PM, CKI Project wrote: >>>>> >>>>> Hello, >>>>> >>>>> We ran automated tests on a recent commit from this kernel tree: >>>>> >>>>> Kernel repo: >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git >>>>> Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into >>>>> for-next >>>>> >>>>> The results of these automated tests are provided below. >>>>> >>>>> Overall result: FAILED (see details below) >>>>> Merge: OK >>>>> Compile: OK >>>>> Tests: PANICKED >>>>> >>>>> All kernel binaries, config files, and logs are available for download >>>>> here: >>>>> >>>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166 >>>>> >>>>> One or more kernel tests failed: >>>>> >>>>> ppc64le: >>>>> 💥 storage: software RAID testing >>>>> >>>>> aarch64: >>>>> 💥 storage: software RAID testing >>>>> >>>>> x86_64: >>>>> 💥 storage: software RAID testing >>>> >>>> Hello, >>>> >>>> We're seeing a panic for all non s390x arches triggered by swraid test. >>>> Seems to be reproducible >>>> for all succeeding pipelines after this one, and we haven't yet seen it in >>>> mainline or yesterday's >>>> block tree results. >>>> >>>> Thank you, >>>> Rachel >>>> >>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log >>>> >>>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP >>>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov >>>> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey >>>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache >>>> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng >>>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan >>>> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci >>>> xhci_plat_hcd >>>> gpio_xgene_sb gpio_keys aes_neon_bs >>>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not >>>> tainted 5.9.0-rc3-020ad03.cki #1 >>>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene >>>> Mustang Board, BIOS 3.06.25 Oct 17 2016 >>>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete >>>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--) >>>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8 >>>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8 >>>> [ 8394.691547] sp : ffff800019f33d00 >>>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000 >>>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228 >>>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80 >>>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000 >>>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000 >>>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000 >>>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000 >>>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003 >>>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978 >>>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001 >>>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000 >>>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001 >>>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40 >>>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000 >>>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000 >>>> [ 8394.774110] Call trace: >>>> [ 8394.776544] percpu_ref_exit+0x5c/0xc8 >>>> [ 8394.780273] md_free+0x64/0xa0 >>>> [ 8394.783311] kobject_put+0x7c/0x218 >>>> [ 8394.786781] mddev_delayed_delete+0x3c/0x50 >>>> [ 8394.790944] process_one_work+0x1c4/0x450 >>>> [ 8394.794932] worker_thread+0x164/0x4a8 >>>> [ 8394.798662] kthread+0xf4/0x120 >>>> [ 8394.801787] ret_from_fork+0x10/0x18 >>>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000) >>>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]--- >>> >>> Ming, I wonder if this is: >>> >>> commit d0c567d60f3730b97050347ea806e1ee06445c78 >>> Author: Ming Lei <ming.lei@xxxxxxxxxx> >>> Date: Wed Sep 2 20:26:42 2020 +0800 >>> >>> percpu_ref: reduce memory footprint of percpu_ref in fast path >>> >>> Rachel, any chance you can do a run with that commit reverted? >> >> Hi Jens, yes we're working on it and will share our findings as soon as the >> job finishes. >> > > Hi Jens, we can confirm that there are no panics and the test passes > with the patch reverted. > > > We also realized that this patch is a likely cause of serious problems > on ppc64le during LTP testing as well, specifically msgstress04. Both > issues started occurring at the same time, we just didn't notice as the > test was crashing. > > > [ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0 > [ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1 > [ 5683.000048] Call Trace: > [ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable) > [ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274 > [ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290 > [ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790 > [ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0 > [ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400 > [ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130 > [ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810 > [ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30 > [ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60 > [ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700 > [ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0 > [ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490 > [ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0 > [ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950 > [ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560 > [ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0 > [ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0 > [ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c > > The test then manages the fill the console log with good 4G of dump... > this is actually visible in the ppc64le console log from the linked > artifacts (warnings, it's a huge file!): > > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log > > > There are also more ppc64le traces in the other log (of reasonable size): > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log I'll revert this change for now. -- Jens Axboe