Re: [PATCH] mkfs: acquire flock before modifying the device superblock

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 14 Oct 2022 08:38:18 -0700

On Fri, Oct 14, 2022 at 04:41:35PM +0800, Wu Guanghao wrote:
> We noticed that systemd has an issue about symlink unreliable caused by
> formatting filesystem and systemd operating on same device.
> Issue Link: https://github.com/systemd/systemd/issues/23746
> 
> According to systemd doc, a BSD flock needs to be acquired before
> formatting the device.
> Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/

TLDR: udevd wants fs utilities to use advisory file locking to
coordinate (re)writes to block devices to avoid collisions between mkfs
and all the udev magic.

Critically, udev calls flock(LOCK_SH | LOCK_NB) to trylock the device in
shared mode to avoid blocking on fs utilities; if the trylock fails,
they'll move on and try again later.  The old O_EXCL-on-blockdevs trick
will not work for that usecase (I guess) because it's not a shared
reader lock.  It's also not the file locking API.

> So we acquire flock after opening the device but before
> writing superblock.

xfs_db and xfs_repair can write to the filesystem too; shouldn't this
locking apply to them as well?

> Signed-off-by: wuguanghao <wuguanghao3@xxxxxxxxxx>
> ---
>  mkfs/xfs_mkfs.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 9dd0e79c..b83cb043 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -13,6 +13,7 @@
>  #include "libfrog/crc32cselftest.h"
>  #include "proto.h"
>  #include <ini.h>
> +#include <sys/file.h>
> 
>  #define TERABYTES(count, blog) ((uint64_t)(count) << (40 - (blog)))
>  #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
> @@ -2758,6 +2759,30 @@ _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
> 
>  }
> 
> +static void
> +lock_device(dev_t dev, int flag, char *name)
> +{
> +       int fd = libxfs_device_to_fd(dev);
> +       int readonly = flag & LIBXFS_ISREADONLY;
> +
> +       if (!readonly && fd > 0)
> +               if (flock(fd, LOCK_EX) != 0) {
> +                       fprintf(stderr, "%s: failed to get lock.\n", name);
> +                       exit(1);
> +               }

So yes, this belongs in libxfs_device_open.

If we're opening the bdevs in readonly mode, shouldn't we take LOCK_SH
to prevent mkfs from colliding with (say) xfs_metadump?

Bonus question: Shouldn't the /kernel/ also effectively be taking
LOCK_SH when it opens the bdevs to mount the filesystem?

--D

> +}
> +
> +static void
> +lock_devices(struct libxfs_xinit *xi)
> +{
> +       if (!xi->disfile)
> +               lock_device(xi->ddev, xi->dcreat, xi->dname);
> +       if (xi->logdev && !xi->lisfile)
> +               lock_device(xi->logdev, xi->lcreat, xi->logname);
> +       if (xi->rtdev && !xi->risfile)
> +               lock_device(xi->rtdev, xi->rcreat, xi->rtname);
> +}
> +
>  static void
>  open_devices(
>         struct mkfs_params      *cfg,
> @@ -4208,6 +4233,7 @@ main(
>          * Open and validate the device configurations
>          */
>         open_devices(&cfg, &xi);
> +       lock_devices(&xi);
>         validate_overwrite(dfile, force_overwrite);
>         validate_datadev(&cfg, &cli);
>         validate_logdev(&cfg, &cli, &logfile);
> --
> 2.27.0