On 2019/12/25 20:54, Chris Down wrote: > Amir Goldstein writes: >>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered >>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing >>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial >>> callsites only leads to about 10% slab reuse, which doesn't really stem the >>> bleeding of 32-bit inums on an affected workload: >>> >>> # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c >>> 4454 recycle_or_get_next_ino: not recycled >>> 546 recycle_or_get_next_ino: recycled >>> >> >> Too bad.. >> Maybe recycled ino should be implemented all the same because it is simple >> and may improve workloads that are not so MEMCG intensive. > > Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for this) since it's not a total solution for the problem, but still helps somewhat and we all seem to agree that it's overall an uncontroversial improvement. Please cc me as well, thanks. > >>> Roman (who I've just added to cc) tells me that currently we only have >>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This >>> contributes fairly significantly here since there are multiple tasks across >>> multiple cgroups which are contributing to the get_next_ino() thrash. >>> >>> I think this is a good start, but we need something of a different magnitude in >>> order to actually solve this problem with the current slab infrastructure. How >>> about something like the following? >>> >>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is >>> 2. Use get_next_ino_full in tmpfs (et al.) >> >> I would prefer that filesystems making heavy use of get_next_ino, be converted >> to use a private ino pool per sb: >> >> ino_pool_create() >> ino_pool_get_next() >> >> flags to ino_pool_create() can determine the desired ino range. >> Does the Facebook use case involve a single large tmpfs or many >> small ones? I would guess the latter and therefore we are trying to solve >> a problem that nobody really needs to solve (i.e. global efficient ino pool). > > Unfortunately in the case under discussion, it's all in one large tmpfs in /dev/shm. I can empathise with that -- application owners often prefer to use the mounts provided to them rather than having to set up their own. For this one case we can change that, but I think it seems reasonable to support this case since using a single tmpfs can be a reasonable decision as an application developer, especially if you only have unprivileged access to the system. > >>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can >>> pass if they want the 32-bit inode numbers back. This would still allow >>> people who want to make this tradeoff to use xino. >> >> inode32|inode64 (see man xfs(5)). > > Ah great, thanks! I'll reuse precedent from those. > >>> 4. (If you like) Also add a CONFIG option to disable this at compile time. >>> >> >> I Don't know about disable, but the default mode for tmpfs (inode32|inode64) >> might me best determined by CONFIG option, so distro builders could decide >> if they want to take the risk of breaking applications on tmpfs. > > Sounds good. > >> But if you implement per sb ino pool, maybe inode64 will no longer be >> required for your use case? > > In this case I think per-sb ino pool will help a bit, but unfortunately not by an order of magnitude. As with the recycling patch this will help reduce thrash a bit but not conclusively prevent the problem from happening long-term. To fix that, I think we really do need the option to use ino_t-sized get_next_ino_full (or per-sb equivalent). > > Happy holidays, and thanks for your feedback! > > Chris > > . >