On Mon, 3 Jun 2024 18:28:51 +0100 James Morse <james.morse@xxxxxxx> wrote: > Hi guys, > > On 03/06/2024 13:48, Jonathan Cameron wrote: > > On Fri, 31 May 2024 20:22:42 -0700 > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > >> Jonathan Cameron wrote: > >>> On Thu, 30 May 2024 14:59:38 +0800 > >>> Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> wrote: > >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > >>>>> It's not just a CXL spec issue, though that is part of it. I think the > >>>>> CXL spec would have to expose some form of puncturing flush, and this > >>>>> makes the assumption that such a flush doesn't cause some kind of > >>>>> race/deadlock issue. Certainly this needs to be discussed. > >>>>> > >>>>> However, consider that the upstream processor actually has to generate > >>>>> this flush. This means adding the flush to existing coherence protocols, > >>>>> or at the very least a new instruction to generate the flush explicitly. > >>>>> The latter seems more likely than the former. > >>>>> > >>>>> This flush would need to ensure the data is forced out of the local WPQ > >>>>> AND all WPQs south of the PCIE complex - because what you really want to > >>>>> know is that the data has actually made it back to a place where remote > >>>>> viewers are capable of percieving the change. > >>>>> > >>>>> So this means: > >>>>> 1) Spec revision with puncturing flush > >>>>> 2) Buy-in from CPU vendors to generate such a flush > >>>>> 3) A new instruction added to the architecture. > >>>>> > >>>>> Call me in a decade or so. > >>>>> > >>>>> > >>>>> But really, I think it likely we see hardware-coherence well before this. > >>>>> For this reason, I have become skeptical of all but a few memory sharing > >>>>> use cases that depend on software-controlled cache-coherency. > >>>> > >>>> Hi Gregory, > >>>> > >>>> From my understanding, we actually has the same idea here. What I am > >>>> saying is that we need SPEC to consider this issue, meaning we need to > >>>> describe how the entire software-coherency mechanism operates, which > >>>> includes the necessary hardware support. Additionally, I agree that if > >>>> software-coherency also requires hardware support, it seems that > >>>> hardware-coherency is the better path. > >>>>> > >>>>> There are some (FAMFS, for example). The coherence state of these > >>>>> systems tend to be less volatile (e.g. mappings are read-only), or > >>>>> they have inherent design limitations (cacheline-sized message passing > >>>>> via write-ahead logging only). > >>>> > >>>> Can you explain more about this? I understand that if the reader in the > >>>> writer-reader model is using a readonly mapping, the interaction will be > >>>> much simpler. However, after the writer writes data, if we don't have a > >>>> mechanism to flush and invalidate puncturing all caches, how can the > >>>> readonly reader access the new data? > >>> > >>> There is a mechanism for doing coarse grained flushing that is known to > >>> work on some architectures. Look at cpu_cache_invalidate_memregion(). > >>> On intel/x86 it's wbinvd_on_all_cpu_cpus() > >> > >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion() > >> that a remote shared memory consumer can be assured to see the writes > >> from that event. > > > > I was wondering about that after I wrote this... I guess it guarantees > > we won't get a late landing write or is that not even true? > > > > So if we remove memory, then added fresh memory again quickly enough > > can we get a left over write showing up? I guess that doesn't matter as > > the kernel will chase it with a memset(0) anyway and that will be ordered > > as to the same address. > > > > However we won't be able to elide that zeroing even if we know the device > > did it which is makes some operations the device might support rather > > pointless :( > > >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a > >>> public alpha specification for PSCI 1.3 with that defined but we > >>> don't yet have kernel code.) > > I have an RFC for that - but I haven't had time to update and re-test it. If it's useful, I might either be able to find time to take that forwards (or get someone else to do it). Let me know if that would be helpful; I'd love to add this to the list of things I can forget about because it just works for kernel (and hence is a problem for the firmware and uarch folk). > > If you need this, and have a platform where it can be implemented, please get in touch > with the people that look after the specs to move it along from alpha. > > > >> That punches visibility through CXL shared memory devices? > > > It's a draft spec and Mark + James in +CC can hopefully confirm. > > It does say > > "Cleans and invalidates all caches, including system caches". > > which I'd read as meaning it should but good to confirm. > > It's intended to remove any cached entries - including lines in what the arm-arm calls > "invisible" system caches, which typically only platform firmware can touch. The next > access should have to go all the way to the media. (I don't know enough about CXL to say > what a remote shared memory consumer observes) If it's out of the host bridge buffers (and known to have succeeded in write back) which I think the host should know, I believe what happens next is a device implementer problem. Hopefully anyone designing a device that does memory sharing has built that part right. > > Without it, all we have are the by-VA operations which are painfully slow for large > regions, and insufficient for system caches. > > As with all those firmware interfaces - its for the platform implementer to wire up > whatever is necessary to remove cached content for the specified range. Just because there > is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform. > > > >>> These are very big hammers and so unsuited for anything fine grained. > > You forgot really ugly too! I was being polite :) > > > >>> In the extreme end of possible implementations they briefly stop all > >>> CPUs and clean and invalidate all caches of all types. So not suited > >>> to anything fine grained, but may be acceptable for a rare setup event, > >>> particularly if the main job of the writing host is to fill that memory > >>> for lots of other hosts to use. > >>> > >>> At least the ARM one takes a range so allows for a less painful > >>> implementation. > > That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not > the regular DRAM). > > On the less painful implementation, arm's interconnect has a gadget that does "Address > based flush" which could be used here. I'd hope platforms with that don't need to > interrupt all CPUs - but it depends on what else needs to be done. > > > >>> I'm assuming we'll see new architecture over time > >>> but this is a different (and potentially easier) problem space > >>> to what you need. > >> > >> cpu_cache_invalidate_memregion() is only about making sure local CPU > >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to > >> get away from that responsibility long term when / if future memory > >> expanders just issue back-invalidate automatically when the HDM decoder > >> configuration changes. > > > > I would love that to be the way things go, but I fear the overheads of > > doing that on the protocol means people will want the option of the painful > > approach. > > > > Thanks, > > James Thanks for the info, Jonathan >