On Mon, Apr 13, 2020 at 1:28 AM Keno Fischer <keno@xxxxxxxxxxxxxxxxxx> wrote: > > > You did not specify your use case. > > My use case is recording (https://rr-project.org/) executions Cool! I should try that ;-) > of containers (which often make heavy use of bind mounts on > the same file system, thus me running into this restriction). > In essence, at relevant read or mmap operations, > rr needs to checkpoint the file that was opened, > in case it later gets deleted or modified. > It always tries to FICLONE the file first, > before deciding heuristically whether to > instead create a copy (if it decides there is a low > likelihood the file will get changed - e.g. because > it's a system file - it may decide to take the chance and > not copy it at the risk of creating a broken recording). > That's often a decent trade-off, but of course it's not > 100% perfect. > > > The question is: do you *really* need cross mount clone? > > Can you use copy_file_range() instead? > > Good question. copy_file_range doesn't quite work > for that initial clone, because we do want it to fail if > cloning doesn't work (so that we can apply the > heuristics). However, you make a good point that > the copy fallback should probably use copy_file_range. > At least that way, if it does decide to copy, the > performance will be better. > > It would still be nice for FICLONE to ease this restriction, > since it reduces the chance of the heuristics getting > it wrong and preventing the copy, even if such > a copy would have been cheap. > You make it sound like the heuristic decision must be made *after* trying to clone, but it can be made before and pass flags to the kernel whether or to fallback to copy. copy_file_range(2) has an unused flags argument. Adding support for flags like: COPY_FILE_RANGE_BY_FS COPY_FILE_RANGE_BY_KERNEL or any other names elected after bike shedding can be used to control whether user intended to use filesystem internal clone/copy methods and/or to fallback to kernel copy. I think this functionality will be useful to many. > > Across which filesystems mounts are you trying to clone? > > This functionality was written with btrfs in mind, so that's > what I was testing with. The mounts themselves are just > different bindmounts into the same filesystem. > I can also suggest a workaround for you. If your only problem is bind mounts and if recorder is a privileged process (CAP_DAC_READ_SEARCH) then you can use a "master" bind mount to perform all clone operations on. Use name_to_handle_at(2) to get sb file handle of source file. Use open_by_handle_at(2) to get an open file descriptor of the source file under the "master" bind mount. Thanks, Amir.