On Mon, Oct 22, 2018 at 6:11 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > On Mon, Oct 22, 2018 at 6:05 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> wrote: > > > > > > Persistent memory is cool. But, currently, you have to rewrite > > > your applications to use it. Wouldn't it be cool if you could > > > just have it show up in your system like normal RAM and get to > > > it like a slow blob of memory? Well... have I got the patch > > > series for you! > > > > > > This series adds a new "driver" to which pmem devices can be > > > attached. Once attached, the memory "owned" by the device is > > > hot-added to the kernel and managed like any other memory. On > > > systems with an HMAT (a new ACPI table), each socket (roughly) > > > will have a separate NUMA node for its persistent memory so > > > this newly-added memory can be selected by its unique NUMA > > > node. > > > > > > This is highly RFC, and I really want the feedback from the > > > nvdimm/pmem folks about whether this is a viable long-term > > > perversion of their code and device mode. It's insufficiently > > > documented and probably not bisectable either. > > > > > > Todo: > > > 1. The device re-binding hacks are ham-fisted at best. We > > > need a better way of doing this, especially so the kmem > > > driver does not get in the way of normal pmem devices. > > > 2. When the device has no proper node, we default it to > > > NUMA node 0. Is that OK? > > > 3. We muck with the 'struct resource' code quite a bit. It > > > definitely needs a once-over from folks more familiar > > > with it than I. > > > 4. Is there a better way to do this than starting with a > > > copy of pmem.c? > > > > So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7 > > and remove all the devm_memremap_pages() infrastructure and dax_region > > infrastructure. > > > > The driver should be a dead simple turn around to call add_memory() > > for the passed in range. The hard part is, as you say, arranging for > > the kmem driver to not stand in the way of typical range / device > > claims by the dax_pmem device. > > > > To me this looks like teaching the nvdimm-bus and this dax_kmem driver > > to require explicit matching based on 'id'. The attachment scheme > > would look like this: > > > > modprobe dax_kmem > > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id > > echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind > > echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind > > > > At step1 the dax_kmem drivers will match no devices and stays out of > > the way of dax_pmem. It learns about devices it cares about by being > > explicitly told about them. Then unbind from the typical dax_pmem > > driver and attach to dax_kmem to perform the one way hotplug. > > > > I expect udev can automate this by setting up a rule to watch for > > device-dax instances by UUID and call a script to do the detach / > > reattach dance. > > The next question is how to support this for ranges that don't > originate from the pmem sub-system. I expect we want dax_kmem to > register a generic platform device representing the range and have a > generic platofrm driver that turns around and does the add_memory(). I forgot I have some old patches that do something along these lines and make device-dax it's own bus. I'll dust those off so we can discern what's left.