Jonathan Cameron wrote: > On Thu, 30 May 2024 14:59:38 +0800 > Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> wrote: > > > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: > > >> > > >> > > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > > >>> Dongsheng Yang wrote: > > >>> > > >>> What guarantees this property? How does the reader know that its local > > >>> cache invalidation is sufficient for reading data that has only reached > > >>> global visibility on the remote peer? As far as I can see, there is > > >>> nothing that guarantees that local global visibility translates to > > >>> remote visibility. In fact, the GPF feature is counter-evidence of the > > >>> fact that writes can be pending in buffers that are only flushed on a > > >>> GPF event. > > >> > > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would > > >> still be data in WPQ even though we perform a CPU cache line flush in the > > >> OS. > > >> > > >> This means we don't have a explicit method to make data puncture all caches > > >> and land in the media after writing. also it seems there isn't a explicit > > >> method to invalidate all caches along the entire path. > > >> > > >>> > > >>> I remain skeptical that a software managed inter-host cache-coherency > > >>> scheme can be made reliable with current CXL defined mechanisms. > > >> > > >> > > >> I got your point now, acorrding current CXL Spec, it seems software managed > > >> cache-coherency for inter-host shared memory is not working. Will the next > > >> version of CXL spec consider it? > > >>> > > > > > > Sorry for missing the conversation, have been out of office for a bit. > > > > > > It's not just a CXL spec issue, though that is part of it. I think the > > > CXL spec would have to expose some form of puncturing flush, and this > > > makes the assumption that such a flush doesn't cause some kind of > > > race/deadlock issue. Certainly this needs to be discussed. > > > > > > However, consider that the upstream processor actually has to generate > > > this flush. This means adding the flush to existing coherence protocols, > > > or at the very least a new instruction to generate the flush explicitly. > > > The latter seems more likely than the former. > > > > > > This flush would need to ensure the data is forced out of the local WPQ > > > AND all WPQs south of the PCIE complex - because what you really want to > > > know is that the data has actually made it back to a place where remote > > > viewers are capable of percieving the change. > > > > > > So this means: > > > 1) Spec revision with puncturing flush > > > 2) Buy-in from CPU vendors to generate such a flush > > > 3) A new instruction added to the architecture. > > > > > > Call me in a decade or so. > > > > > > > > > But really, I think it likely we see hardware-coherence well before this. > > > For this reason, I have become skeptical of all but a few memory sharing > > > use cases that depend on software-controlled cache-coherency. > > > > Hi Gregory, > > > > From my understanding, we actually has the same idea here. What I am > > saying is that we need SPEC to consider this issue, meaning we need to > > describe how the entire software-coherency mechanism operates, which > > includes the necessary hardware support. Additionally, I agree that if > > software-coherency also requires hardware support, it seems that > > hardware-coherency is the better path. > > > > > > There are some (FAMFS, for example). The coherence state of these > > > systems tend to be less volatile (e.g. mappings are read-only), or > > > they have inherent design limitations (cacheline-sized message passing > > > via write-ahead logging only). > > > > Can you explain more about this? I understand that if the reader in the > > writer-reader model is using a readonly mapping, the interaction will be > > much simpler. However, after the writer writes data, if we don't have a > > mechanism to flush and invalidate puncturing all caches, how can the > > readonly reader access the new data? > > There is a mechanism for doing coarse grained flushing that is known to > work on some architectures. Look at cpu_cache_invalidate_memregion(). > On intel/x86 it's wbinvd_on_all_cpu_cpus() There is no guarantee on x86 that after cpu_cache_invalidate_memregion() that a remote shared memory consumer can be assured to see the writes from that event. > on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a > public alpha specification for PSCI 1.3 with that defined but we > don't yet have kernel code.) That punches visibility through CXL shared memory devices? > These are very big hammers and so unsuited for anything fine grained. > In the extreme end of possible implementations they briefly stop all > CPUs and clean and invalidate all caches of all types. So not suited > to anything fine grained, but may be acceptable for a rare setup event, > particularly if the main job of the writing host is to fill that memory > for lots of other hosts to use. > > At least the ARM one takes a range so allows for a less painful > implementation. I'm assuming we'll see new architecture over time > but this is a different (and potentially easier) problem space > to what you need. cpu_cache_invalidate_memregion() is only about making sure local CPU sees new contents after an DPA:HPA remap event. I hope CPUs are able to get away from that responsibility long term when / if future memory expanders just issue back-invalidate automatically when the HDM decoder configuration changes.