On Sat, Jan 11, 2025 at 6:32 AM David Wang <00107082@xxxxxxx> wrote: > > Hi, Hi David, Sorry for the delay. I'm not ignoring your input, I'm just a bit busy and didn't have time to properly reply to your questions. > > I have using this feature for a long while, and I believe this memory alloc profiling feature > is quite powerful. > > But, I have been wondering how to use this data, specifically: > how anomaly could be detected, what pattern should be defined as anomaly? > > So far, I have tools collecting those data (via prometheus), make basic analysis, i.e. top-k, group-by or rate. > Those analysis help me understand my system, but I cannot tell whether it is abnormal or not. > > And sometimes I would just read through /proc/allocinfo, trying to pickup something. > (Sometimes get lucky, actually only once, find the underflow problem weeks ago.) > > A tool would be more helpful if it can identify anomalies, and we can add more pattern as develop along. You are absolutely correct. An automatic detection of problematic patterns would be the ultimate goal. We are analyzing the data we collect and trying to come up with strategies for identifying such patterns. Simple and obvious pattern for a leak would be constant growth but there might be others like sawtooth pattern or spikes which could point to opportunities to optimize the usage by employing object pools/caches. Categorizing allocations into hierarchical groups and measuring per-group consumption might be another useful technique we are considering. All this is in quite early stages, so ideas and suggestions from people using this API would be very valuable. > > A pattern may be hard to define, especially when it involves context. For example, > I happened to notice following strange things recently: > > 896 14 kernel/sched/topology.c:2275 func:__sdt_alloc 1025 > 896 14 kernel/sched/topology.c:2266 func:__sdt_alloc 1025 > 96 6 kernel/sched/topology.c:2259 func:__sdt_alloc 1025 > 12288 24 kernel/sched/topology.c:2252 func:__sdt_alloc 1025 <----- B > 0 0 kernel/sched/topology.c:2242 func:__sdt_alloc 210 > 0 0 kernel/sched/topology.c:2238 func:__sdt_alloc 210 > 0 0 kernel/sched/topology.c:2234 func:__sdt_alloc 210 > 0 0 kernel/sched/topology.c:2230 func:__sdt_alloc 210 <----- A > Code A > 2230 sdd->sd = alloc_percpu(struct sched_domain *); > 2231 if (!sdd->sd) > 2232 return -ENOMEM; > 2233 > > Code B > 2246 for_each_cpu(j, cpu_map) { > ... > > 2251 > 2252 sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), > 2253 GFP_KERNEL, cpu_to_node(j)); > 2254 if (!sd) > 2255 return -ENOMEM; > 2256 > 2257 *per_cpu_ptr(sdd->sd, j) = sd; > > > The address of memory alloced by 'Code B', is stored in memory "Code A', the allocation counter for 'Code A' > is *0*, while 'Code B' is not *0*. Something odd happens here, either it is expected and some ownership changes happened somewhere > , or it is a leak, or it is an accounting problem. > > If a tool can help identify this kind of pattern, that would be great!~ Hmm. I don't see an easy way to identify such code dependencies from allocinfo data alone. I think that would involve some sophisticated code analysis tooling. > > > Any suggestions about how to proceed with the memory problem of kernel/sched/topology.c mentioneded > above?, or is it a problem at all?