On Fri, Apr 3, 2020 at 11:48 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, Apr 01, 2020 at 09:22:24AM +0800, Yafang Shao wrote: > > On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote: > > > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > With the newly added facility, we can know when these events occur > > and how long each event takes. > > Then we can use these datas to generate a Latency Heat Map[1] and to > > compare whether these latencies match with the application latencies > > recoreded in its log - for example the slow query log in mysql. If the > > refault latencies match with the slow query log, then these slow > > queries are caused by these workingset refault. If the refault > > latencies don't match with slow query log, IOW much smaller than the > > slow query log, then the slow query log isn't caused by the > > workingset refault. > > Okay, you want to use it much finer-grained to understand individual > end-to-end latencies for your services, rather than "the system is > melting down and I want to know why." That sounds valid to me. > Right, individual end-to-end latencies are very important for the latency sensitive services. > > > > > Can you elaborate a bit how you are using this information? It's not > > > > > quite clear to me from the example in patch #2. > > > > > > > > > > > > > From the traced data in patch #2, we can find that the high latencies > > > > of user tasks are always type 7 of memstall , which is > > > > MEMSTALL_WORKINGSET_THRASHING, and then we should look into the > > > > details of wokingset of the user tasks and think about how to improve > > > > it - for example, by reducing the workingset. > > > > > > That's an analyses we run frequently as well: we see high pressure, > > > and then correlate it with the events. > > > > > > High rate of refaults? The workingset is too big. > > > > > > High rate of compaction work? Somebody is asking for higher order > > > pages under load; check THP events next. > > > > > > etc. > > > > > > This works fairly reliably. I'm curious what the extra per-event > > > latency breakdown would add and where it would be helpful. > > > > > > I'm not really opposed to your patches it if it is, I just don't see > > > the usecase right now. > > > > > > > As I explained above, the rate of these events can't give us a full > > view of the memory pressure. High memory pressure may not caused by > > high rate of vmstat events, while it can be caused by low rate of > > events but with high latencies. Latencies are the application really > > concerns, that's why PSI is very useful for us. > > > > Furthermore, there're some events are not recored in vmstat. e.g. > > > > typf of memstall vmstat event > > > > MEMSTALL_KSWAPD pageoutrun, {pgscan, > > pgsteal}_kswapd > > MEMSTALL_RECLAIM_DIRECT {pgscan,steal}_direct > > MEMSTALL_RECLAIM_MEMCG /* no event */ > > MEMSTALL_RECLAIM_HIGH /* no event */ > > MEMSTALL_KCOMPACTD compact_daemon_wake > > MEMSTALL_COMPACT compact_{stall, fail, success} > > MEMSTALL_WORKINGSET_REFAULT workingset_refault > > MEMSTALL_WORKINGSET_THRASH /* no event */ > > MEMSTALL_MEMDELAY /* no event */ > > MEMSTALL_SWAPIO pswpin > > > > After we add these types of memstall, we don't need to add these > > missed events one by one. > > I'm a bit concerned about the maintainability of these things. It > makes moving code around harder, and it forces somebody who has no > interest in this debugging facility to thing about the categories. > > And naming them is hard even for somebody who does care. I'm not a fan > of MEMSTALL_MEMDELAY, for example because it's way too > non-descript. The distinction between MEMSTALL_WORKINGSET_REFAULT and > MEMSTALL_WORKINGSET_THRASH is dubious as well. > Agree with you that the naming is not good. > These are recipes for bit rot and making the code harder to hack on. > > I see two options to do this better: One is to use stack traces as > identifiers instead of a made-up type. The other is to use the calling > function as the id (see how kmalloc_track_caller() utilizes _RET_IP_). > > bpftrace can use the stack as a map key. So this should already work > without any kernel modifications, using @start[tid, kstack]? > If we don't make any kernel modifications, it is not easy to get whehter the psi_memstall_{enter, leave} is nested or not. The nested psi_memstall_{enter, leave} may make some noises. Seems the first option is better. With _RET_IP_ we can also get the caller. So how about adding tracepoints for psi_memstall_{enter, leave} as bellow ? @@ -904,7 +904,7 @@ void psi_memstall_enter(unsigned long *flags, enum memstall_types type) if (*flags) return; + trace_psi_memstall_enter(_RET_IP_); @@ -944,7 +943,7 @@ void psi_memstall_leave(unsigned long *flags, enum memstall_types type) if (*flags) return; + trace_psi_memstall_leave(_RET_IP_); Thanks Yafang