On Wed, May 18, 2016 at 10:26:57AM -0700, Dan Williams wrote: > On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney > <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: > >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: > >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: > >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote: > >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >> >>>>> The "Device DAX" core enables dax mappings of performance / feature > >> >>>>> differentiated memory. An open mapping or file handle keeps the backing > >> >>>>> struct device live, but new mappings are only possible while the device > >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize > >> >>>>> with the enabled state of the device. > >> >>>>> > >> >>>>> Similar to the filesystem-dax case the backing memory may optionally > >> >>>>> have struct page entries. However, unlike fs-dax there is no support > >> >>>>> for private mappings, or mappings that are not backed by media (see > >> >>>>> use of zero-page in fs-dax). > >> >>>>> > >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. > >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings > >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap > >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >> >>>>> like MAP_PRIVATE mappings. > >> >>>>> > >> >>>>> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> > >> >>>>> Cc: Christoph Hellwig <hch@xxxxxx> > >> >>>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > >> >>>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > >> >>>>> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> > >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > >> >>>>> --- > >> >>>>> drivers/dax/Kconfig | 1 > >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >>>>> mm/huge_memory.c | 1 > >> >>>>> mm/hugetlb.c | 1 > >> >>>>> 4 files changed, 319 insertions(+) > >> >>>>> > >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 > >> >>>>> --- a/drivers/dax/Kconfig > >> >>>>> +++ b/drivers/dax/Kconfig > >> >>>>> @@ -1,6 +1,7 @@ > >> >>>>> menuconfig DEV_DAX > >> >>>>> tristate "DAX: direct access to differentiated memory" > >> >>>>> default m if NVDIMM_DAX > >> >>>>> + depends on TRANSPARENT_HUGEPAGE > >> >>>>> help > >> >>>>> Support raw access to differentiated (persistence, bandwidth, > >> >>>>> latency...) memory via an mmap(2) capable character > >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 > >> >>>>> --- a/drivers/dax/dax.c > >> >>>>> +++ b/drivers/dax/dax.c > >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { > >> >>>>> * @region - parent region > >> >>>>> * @dev - device backing the character device > >> >>>>> * @kref - enable this data to be tracked in filp->private_data > >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established > >> >>>>> * @id - child id in the region > >> >>>>> * @num_resources - number of physical address extents in this device > >> >>>>> * @res - array of physical address ranges > >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { > >> >>>>> struct dax_region *region; > >> >>>>> struct device *dev; > >> >>>>> struct kref kref; > >> >>>>> + bool alive; > >> >>>>> int id; > >> >>>>> int num_resources; > >> >>>>> struct resource res[0]; > >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >> >>>>> > >> >>>>> dev_dbg(dev, "%s\n", __func__); > >> >>>>> > >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ > >> >>>>> + dax_dev->alive = false; > >> >>>>> + synchronize_rcu(); > > > > If you need to wait for fault handlers, you need synchronize_sched() > > instead of synchronize_rcu(). Please note that synchronize_rcu() is > > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > > is guaranteed to wait for any non-idle preempt-disable region of code > > to complete, regardless of exactly what is disabling preemptiong. > > > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > > handlers that somehow occur from an idle hardware thread, you will need > > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > > on exit. (My guess is that you cannot take faults in the idle loop, > > but I have learned not to trust such guesses all that far.) > > > > And last, but definitely not least, synchronize_sched() waits only > > for pre-existing preempt-disable regions of code. So if you do > > synchronize_sched(), and immediately after a fault handler starts, > > synchronize_sched() won't necessarily wait on it. However, you -are- > > guaranteed that synchronize_shced() -will- wait for any fault handler > > that might possibly see dax_dev->alive with a non-false value. > > > > Are these the guarantees you are looking for? (Yes, I did recently > > watch "A New Hope". Why do you ask?) > > Spoken like a true rcu-Jedi. ;-) > So in this case the fault handlers are indeed running under > rcu_read_lock(), and I can't fathom how these faults would trigger > from an idle thread... OK, given all fault handlers use rcu_read_lock() and either: 1. As you say, faults never trigger from idle, or 2. The fault handlers call rcu_irq_enter() on entry and rcu_irq_exit() on exit Then, yes, you can keep on using synchronize_rcu(). > >> >>>>> + > >> >>>> > >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >> >>>> looks wrong to me. > > > > RCU can be, and usually is, used to protect pointers, but it can be and > > sometimes is used for other things as well. At its core, RCU is about > > waiting for pre-existing RCU readers to complete. > > > >> >>> The driver is using RCU to guarantee that all currently running fault > >> >>> handlers have either completed or will see the new state of ->alive > >> >>> when they start. Reference counts are protecting the actual dax_dev > >> >>> object. > >> >>> > >> >> Hmm. > >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying > >> >> when trying to improve device-mapper performance. > > > > To repeat, unless all your fault handlers begin with rcu_read_lock() > > and end with rcu_read_unlock(), and as long as you don't care about not > > waiting for fault handlers that are currently executing just before > > the rcu_read_lock() and just after the rcu_read_unlock(), you need > > synchronize_sched() rather than synchronize_rcu() for this job. > > > >> >> >From my understanding RCU is protecting the _pointer_, not the > >> >> values of the structure pointed to. > >> >> IOW we are guaranteed to have a valid pointer at any time. > >> >> But at the same time _no_ guarantee is made about the _contents_ of > >> >> the structure. > >> >> It might well be that using 'synchronize_rcu' giving you similar > >> >> results (as synchronize_rcu() is essentially waiting a SMP grace > >> >> period, after which all CPUs should be seeing the update). > >> >> However, I haven't been able to find that this is a guaranteed > >> >> behaviour. > >> >> So from my understanding you have to use locking primitives > >> >> protecting the contents of the structure or exchange the _entire_ > >> >> structure if you want to rely on RCU here. > >> >> > >> >> Can we get some clarification here? > > > > Maybe... What exactly is your synchronization design needing here? > > > >> >> Paul? > >> > > >> > I think you want the other Paul, Paul McKenney. > >> > > >> I think you are in fact right. > >> Sorry for the Paul-confusion :-) > > > > Did I keep my end of the confusion up? ;-) > > Yes, I think we're good, but please double check I am not mistaken in > the following clarification comment: > > @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* > + * Note, rcu is not protecting the liveness of dax_dev, rcu is > + * ensuring that any fault handlers that might have seen > + * dax_dev->alive == true, have completed. Any fault handlers > + * that start after synchronize_rcu() has started will abort > + * upon seeing dax_dev->alive == false. > + */ > + dax_dev->alive = false; > + synchronize_rcu(); > + > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region > *dax_region, struct resource *re Given your comment and your statement that fault handlers never happen on idle CPUs and are always protected by rcu_read_lock(), from an RCU point of view: Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html