Amir Goldstein writes:
The slab i_ino recycling approach works somewhat, but is unfortunately neutered
quite a lot by the fact that slab recycling is per-memcg. That is, replacing
with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
callsites only leads to about 10% slab reuse, which doesn't really stem the
bleeding of 32-bit inums on an affected workload:
# tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
4454 recycle_or_get_next_ino: not recycled
546 recycle_or_get_next_ino: recycled
Too bad..
Maybe recycled ino should be implemented all the same because it is simple
and may improve workloads that are not so MEMCG intensive.
Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for
this) since it's not a total solution for the problem, but still helps somewhat
and we all seem to agree that it's overall an uncontroversial improvement.
Roman (who I've just added to cc) tells me that currently we only have
per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
contributes fairly significantly here since there are multiple tasks across
multiple cgroups which are contributing to the get_next_ino() thrash.
I think this is a good start, but we need something of a different magnitude in
order to actually solve this problem with the current slab infrastructure. How
about something like the following?
1. Add get_next_ino_full, which uses whatever the full width of ino_t is
2. Use get_next_ino_full in tmpfs (et al.)
I would prefer that filesystems making heavy use of get_next_ino, be converted
to use a private ino pool per sb:
ino_pool_create()
ino_pool_get_next()
flags to ino_pool_create() can determine the desired ino range.
Does the Facebook use case involve a single large tmpfs or many
small ones? I would guess the latter and therefore we are trying to solve
a problem that nobody really needs to solve (i.e. global efficient ino pool).
Unfortunately in the case under discussion, it's all in one large tmpfs in
/dev/shm. I can empathise with that -- application owners often prefer to use
the mounts provided to them rather than having to set up their own. For this
one case we can change that, but I think it seems reasonable to support this
case since using a single tmpfs can be a reasonable decision as an application
developer, especially if you only have unprivileged access to the system.
3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
pass if they want the 32-bit inode numbers back. This would still allow
people who want to make this tradeoff to use xino.
inode32|inode64 (see man xfs(5)).
Ah great, thanks! I'll reuse precedent from those.
4. (If you like) Also add a CONFIG option to disable this at compile time.
I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
might me best determined by CONFIG option, so distro builders could decide
if they want to take the risk of breaking applications on tmpfs.
Sounds good.
But if you implement per sb ino pool, maybe inode64 will no longer be
required for your use case?
In this case I think per-sb ino pool will help a bit, but unfortunately not by
an order of magnitude. As with the recycling patch this will help reduce thrash
a bit but not conclusively prevent the problem from happening long-term. To fix
that, I think we really do need the option to use ino_t-sized get_next_ino_full
(or per-sb equivalent).
Happy holidays, and thanks for your feedback!
Chris