On 11/2/18 12:49 PM, Michal Hocko wrote: > On Fri 02-11-18 12:31:09, Marinko Catovic wrote: >> Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@xxxxxxxx>: >>> >>> On Thu 01-11-18 23:46:27, Marinko Catovic wrote: >>>> Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@xxxxxxxx>: >>>>> >>>>> On Wed 31-10-18 20:21:42, Marinko Catovic wrote: >>>>>> Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@xxxxxxxx>: >>>>>>> >>>>>>> On Wed 31-10-18 15:53:44, Marinko Catovic wrote: >>>>>>> [...] >>>>>>>> Well caching of any operations with find/du is not necessary imho >>>>>>>> anyway, since walking over all these millions of files in that time >>>>>>>> period is really not worth caching at all - if there is a way you >>>>>>>> mentioned to limit the commands there, that would be great. >>>>>>> >>>>>>> One possible way would be to run this find/du workload inside a memory >>>>>>> cgroup with high limit set to something reasonable (that will likely >>>>>>> require some tuning). I am not 100% sure that will behave for metadata >>>>>>> mostly workload without almost any pagecache to reclaim so it might turn >>>>>>> out this will result in other issues. But it is definitely worth trying. >>>>>> >>>>>> hm, how would that be possible..? every user has its UID, the group >>>>>> can also not be a factor, since this memory restriction would apply to >>>>>> all users then, find/du are running as UID 0 to have access to >>>>>> everyone's data. >>>>> >>>>> I thought you have a dedicated script(s) to do all the stats. All you >>>>> need is to run that particular script(s) within a memory cgroup >>>> >>>> yes, that is the case - the scripts are running as root, since as >>>> mentioned all users have own UIDs and specific groups, so to have >>>> access one would need root privileges. >>>> My question was how to limit this using cgroups, since afaik limits >>>> there apply to given UIDs/GIDs >>> >>> No. Limits apply to a specific memory cgroup and all tasks which are >>> associated with it. There are many tutorials on how to configure/use >>> memory cgroups or cgroups in general. If I were you I would simply do >>> this >>> >>> mount -t cgroup -o memory none $SOME_MOUNTPOINT >>> mkdir $SOME_MOUNTPOINT/A >>> echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes >>> >>> Your script then just do >>> echo $$ > $SOME_MOUNTPOINT/A/tasks >>> # rest of your script >>> echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty >>> >>> That should drop the memory cached on behalf of the memcg A including the >>> metadata. >> >> well, that's an interesting approach, I did not know that this was >> possible to assign cgroups to PIDs, without additionally explicitly >> defining UID/GID. This way memory.force_empty basically acts like echo >> 3 > drop_caches, but only for the memory affected by the PIDs and its >> children/forks from the A/tasks-list, true? > > Yup > >> I'll give it a try with the nightly du/find jobs, thank you! > > I am still a bit curious how that will work out on metadata mostly > workload because we usually have quite a lot of memory on normal LRUs to > reclaim (page cache, anonymous memory) and slab reclaim is just to > balance kmem. But let's see. Watch for memcg OOM killer invocations if > the reclaim is not sufficient. > >>> [...] >>>>>> As I understand everyone would have this issue when extensive walking >>>>>> over files is performed, basically any `cloud`, shared hosting or >>>>>> storage systems should experience it, true? >>>>> >>>>> Not really. You need also a high demand for high order allocations to >>>>> require contiguous physical memory. Maybe there is something in your >>>>> workload triggering this particular pattern. >>>> >>>> I would not even know what triggers it, nor what it has to do with >>>> high order, I'm just running find/du, nothing special I'd say. >>> >>> Please note that find/du is mostly a fragmentation generator. It >>> seems there is other system activity which requires those high order >>> allocations. >> >> any idea how to find out what that might be? I'd really have no idea, >> I also wonder why this never was an issue with 3.x >> find uses regex patterns, that's the only thing that may be unusual. > > The allocation tracepoint has the stack trace so that might help. This Well we already checked the mm_page_alloc traces and it seemed that only THP allocations could be the culprit. But apparently defrag=defer made no difference. I would still recommend it so we can see the effects on the traces. And adding tracepoints compaction/mm_compaction_try_to_compact_pages and compaction/mm_compaction_suitable as I suggested should show which high-order allocations actually invoke the compaction. > is quite a lot of work to pin point and find a pattern though. This is > way out the time scope I can devote to this unfortunately. This might be > some driver asking for more, or even the core kernel being more high > order memory hungry. >