> On Jun 19, 2019, at 6:00 AM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote: > > On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@xxxxxxxxxx> wrote: >>> On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@xxxxxxxxxx> wrote: >>> >>>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: >>>> >>>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@xxxxxxxxxx> wrote: >>>> >>>>> For efficient search of resources, as needed to determine the memory >>>>> type for dax page-faults, introduce a cache of the most recently used >>>>> top-level resource. Caching the top-level should be safe as ranges in >>>>> that level do not overlap (unlike those of lower levels). >>>>> >>>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource >>>>> is added, removed or changed, invalidate all the resources. The >>>>> invalidation takes place when the resource_lock is taken for write, >>>>> preventing possible races. >>>>> >>>>> This patch provides relatively small performance improvements over the >>>>> previous patch (~0.5% on sysbench), but can benefit systems with many >>>>> resources. >>>> >>>>> --- a/kernel/resource.c >>>>> +++ b/kernel/resource.c >>>>> @@ -53,6 +53,12 @@ struct resource_constraint { >>>>> >>>>> static DEFINE_RWLOCK(resource_lock); >>>>> >>>>> +/* >>>>> + * Cache of the top-level resource that was most recently use by >>>>> + * find_next_iomem_res(). >>>>> + */ >>>>> +static DEFINE_PER_CPU(struct resource *, resource_cache); >>>> >>>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a >>>> bit odd - the latency getting at that rwlock will swamp the benefit of >>>> isolating the CPUs from each other when accessing resource_cache. >>>> >>>> On the other hand, if we have multiple CPUs running >>>> find_next_iomem_res() concurrently then yes, I see the benefit. Has >>>> the benefit of using a per-cpu cache (rather than a kernel-wide one) >>>> been quantified? >>> >>> No. I am not sure how easy it would be to measure it. On the other hander >>> the lock is not supposed to be contended (at most cases). At the time I saw >>> numbers that showed that stores to “exclusive" cache lines can be as >>> expensive as atomic operations [1]. I am not sure how up to date these >>> numbers are though. In the benchmark I ran, multiple CPUs ran >>> find_next_iomem_res() concurrently. >>> >>> [1] https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsigops.org%2Fs%2Fconferences%2Fsosp%2F2013%2Fpapers%2Fp33-david.pdf&data=02%7C01%7Cnamit%40vmware.com%7Ca2706c5ab2c544283f3b08d6f4b6152b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636965460234022371&sdata=cD7Nhs4jcJGMD7Lav6D%2BC6E5Sei0DiWhKXL7vz2cVHA%3D&reserved=0 >> >> Just to clarify - the main motivation behind the per-cpu variable is not >> about contention, but about the fact the different processes/threads that >> run concurrently might use different resources. > > IIUC, the underlying problem is that dax relies heavily on ioremap(), > and ioremap() on x86 takes too long because it relies on > find_next_iomem_res() via the __ioremap_caller() -> > __ioremap_check_mem() -> walk_mem_res() path. I don’t know much about this path and whether it is painful. The path I was regarding is during page-fault handling: - handle_mm_fault - __handle_mm_fault - do_wp_page - ext4_dax_fault - ext4_dax_huge_fault - dax_iomap_fault - dax_iomap_pte_fault - vmf_insert_mixed_mkwrite - __vm_insert_mixed - track_pfn_insert - lookup_memtype - pat_pagerange_is_ram But indeed track_pfn_insert() in x86 specific. I guess the differences are due to the page-table controlling the cachability in x86 (PAT), but I don’t know much about other architectures and whether they have similar cachability controls in the page-tables.