Hi,
recently we were faced with a bug that caused an infinite loop in a
container. It was from a package (node.js) retaining a scriptlet to
replace a directory as recommended in Fedora Guidelines [0].
There the errno and reason returned [1] is effectively ignored. Only
whether the operation succeed is checked.
There is no differentiation whether the rename failed because the
directory exists which will correctly
cause a rename with a numerical index appended to the directory name. Or
if the rename failed due to any of the other reasons
we might consider as a fail where jumping out of a scriptlet is
preferable to even attempting a rename.
The next numerical index is found with a trial-and-error that runs in a
loop until some index succeeds in the rename, therein lies the potential
for infinite loop.
Conclusion, before I get to the long longer "why". I propose we change
the form from:
~~~
path = "/path/to/dir"
st = posix.stat(path)
if st and st.type == "directory" then
status = os.rename(path, path .. ".rpmmoved")
if not status then
suffix = 0
while not status do
suffix = suffix + 1
status = os.rename(path .. ".rpmmoved", path .. ".rpmmoved." ..
suffix)
end
os.rename(path, path .. ".rpmmoved")
end
end
~~~
to something along the lines of:
~~~
path = "/path/to/dir"
st = posix.stat(path)
reason = nil
if st and st.type == "directory" then
status, reason, errno = os.rename(path, path .. ".rpmmoved")
if not status and errno == 17 then
suffix = 0
while not status and errno == 17 do
suffix = suffix + 1
status, reason, errno = os.rename(path .. ".rpmmoved", path ..
".rpmmoved." .. suffix)
if errno ~= 17 then
print("Failed to rename a directory. " .. reason)
break
end
end
os.rename(path, path .. ".rpmmoved")
else
print("Failed to rename a directory. " .. reason)
end
end
~~~
Not the best I have ever written but it adds checks that it failed for
an expected reason where we intend to continue.
A bit of Lua & overlayFS background of why this is happening.
The scriptlet uses `os.rename`, which maps quite closely, from what I
can tell from the behavior, to the C function `rename`.
OverlayFS does not like to rename directories from lower layers (which
could be a layer created by let's say `RUN mkdir /opt/foo`
directive in a Containerfile) to upper layers without some specific
mount options, like `metacopy` or `redirect_dir` enabled that
enables the rename operation to succeed [2]. If there is not metacopy or
redirect_dir, the operation will fail with errno 18 EXDEV.
From what I understand, when one `podman run`s a container, there is a
new upper layer for separating what happens in the running container,
giving the temporary nature to changes in a running container. Or this
"new" layer atop lower layers can also be observed when attempting to
build a container.
So this is a concern for both the situation of container build and the
situation of container running.
In both of those cases, if we attempt to rename a directory with that
call it will fail without the mentioned mount options. However thanks
to the loop that ignores the exact errno and just blindly tries to find
an index until the `os.rename` returns `true` instead of `nil`, we get
into a
situation where the scriptlet, on package upgrade, goes into an infinite
loop.
A simple reproducer for this issue:
~~~
podman run -it --rm quay.io/fedora/fedora:41 bash -c 'dnf install -y lua
>/dev/null && lua -e "status, str, code = os.rename(\"/var/games\",
\"/var/games.moved\"); print(status, str, code)"; cat /proc/mounts |
grep overlay'
~~~
This will attempt the rename close to what we'd expect in a scriptlet,
and print the reason and specific errno code. Additionally prints how
the overlay is mounted, to inspect options.
This operation will succeed if there is either `redirect_dir=on` or the
`metacopy=on` present in the mount options on the last line printed from
running the container.
`/var/games` was chosen due to no particular reason other than the
directory being present on lower layers and not strictly necessary for
runtime, allowing for experimentation.
Specifically with node.js, it happened simply because there was already
an existing container, similar to quay.io/fedora/nodejs-22 and an
attempt to update via build was made in style of:
~~~
FROM quay.io/fedora/nodejs-22
USER 0
RUN dnf update -y
USER default
~~~
This was enough to trigger the problem. This is very similar to the
reproducer: `os.rename` was called on an already existing directory.
However the directory was in a lower layer, making the fail related to
the errno 18 EXDEV
instead of errno 17 EEXIST which we would regard as a valid errno where
we can continue.
Thoughts? Feedback?
Regards,
Jarek Prokop
[0]
https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/#_scriptlet_to_replace_a_directory
[1] https://www.lua.org/manual/5.3/manual.html#pdf-os.rename
[2]
https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html#renaming-directories
--
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue