Hi all, I'm trying to reduce stack usage in my bpf program. I moved over to using `bpf_core_read()` instead of `bpf_probe_read()` and it appears to have made my program exceed the 512 byte stack limit. Are there any profiler tools or compiler flags I can use to figure out what is exactly using up the most memory? Additionally, does anyone have good examples they can point me to of storing structures in per_cpu maps or local storage mechanisms? Thanks so much! Grant