> On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@xxxxxxxxxxxxx> wrote: > > On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: >> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: >>> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: >>>> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: >>>>> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: >>>>>> The commit "arm64: switch to generic version of pte allocation" >>>>>> introduced endless failures during boot like, >>>>>> >>>>>> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: >>>>>> -2 parent: cgroup) >>>>>> >>>>>> It turns out __GFP_ACCOUNT is passed to kernel page table allocations >>>>>> and then later memcg finds out those don't belong to any cgroup. >>>>> >>>>> Mike, I understood from [1] that this wasn't expected to be a problem, >>>>> as the accounting should bypass kernel threads. >>>>> >>>>> Was that assumption wrong, or is something different happening here? >>>>> >>>>>> >>>>>> backtrace: >>>>>> kobject_add_internal >>>>>> kobject_init_and_add >>>>>> sysfs_slab_add+0x1a8 >>>>>> __kmem_cache_create >>>>>> create_cache >>>>>> memcg_create_kmem_cache >>>>>> memcg_kmem_cache_create_func >>>>>> process_one_work >>>>>> worker_thread >>>>>> kthread >>>>>> >>>>>> Signed-off-by: Qian Cai <cai@xxxxxx> >>>>>> --- >>>>>> arch/arm64/mm/pgd.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c >>>>>> index 769516cb6677..53c48f5c8765 100644 >>>>>> --- a/arch/arm64/mm/pgd.c >>>>>> +++ b/arch/arm64/mm/pgd.c >>>>>> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) >>>>>> if (PGD_SIZE == PAGE_SIZE) >>>>>> return (pgd_t *)__get_free_page(gfp); >>>>>> else >>>>>> - return kmem_cache_alloc(pgd_cache, gfp); >>>>>> + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); >>>>> >>>>> This is used to allocate PGDs for both user and kernel pagetables (e.g. >>>>> for the efi runtime services), so while this may fix the regression, I'm >>>>> not sure it's the right fix. >>>>> >>>>> Do we need a separate pgd_alloc_kernel()? >>>> >>>> So can I take the above for -rc5, or is somebody else working on a different >>>> fix to implement pgd_alloc_kernel()? >>> >>> The offensive commit "arm64: switch to generic version of pte allocation" is not >>> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt >>> Andrew will push this out any time sooner given it is broken. >> >> I'd assumed that Mike would respin these patches to implement and use >> pgd_alloc_kernel() (or take gfp flags) and the updated patches would >> replace these in akpm's tree. >> >> Mike, could you confirm what your plan is? I'm happy to review/test >> updated patches for arm64. > > Sorry for the delay, I'm mostly offline these days. > > I wanted to understand first what is the reason for the failure. I've tried > to reproduce it with qemu, but I failed to find a bootable configuration > that will have PGD_SIZE != PAGE_SIZE :( > > Qian Cai, can you share what is your environment and the kernel config? https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 4 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 BogoMIPS: 400.00 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 32768K NUMA node0 CPU(s): 0-127 NUMA node1 CPU(s): 128-255 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm # dmidecode Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: HPE Product Name: Apollo 70 Version: X1 Wake-up Type: Power Switch Family: CN99XX