Re: cephfs kernel client blocks when removing large files

Dylan McCulloch <dmc@xxxxxxxxxxxxxx> · Tue, 20 Nov 2018 06:58:52 +0000

> On Mon, Oct 22, 2018 at 7:47 PM Dylan McCulloch <dmc at unimelb.edu.au> wrote:

> >

> > > On Mon, Oct 22, 2018 at 2:37 PM Dylan McCulloch <dmc at unimelb.edu.au> wrote:

> > > >

> > > > >

> > > > > On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch <dmc at unimelb.edu.au> wrote:

> > > > > >

> > > > > > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch <dmc at unimelb.edu.au>> wrote:

> > > > > > >>

> > > > > > >> Hi all,

> > > > > > >>

> > > > > > >>

> > > > > > >> We have identified some unexpected blocking behaviour by the ceph-fs kernel client.

> > > > > > >>

> > > > > > >>

> > > > > > >> When performing 'rm' on large files (100+GB), there appears to be a significant delay of 10 seconds or more, before a 'stat' operation can be performed on the same directory on the filesystem.

> > > > > > >>

> > > > > > >>

> > > > > > >> Looking at the kernel client's mds inflight-ops, we observe that there are pending

> > > > > > >>

> > > > > > >> UNLINK operations corresponding to the deleted files.

> > > > > > >>

> > > > > > >>

> > > > > > >> We have noted some correlation between files being in the client page cache and the blocking behaviour. For example, if the cache is dropped or the filesystem remounted the blocking will not occur.

> > > > > > >>

> > > > > > >>

> > > > > > >> Test scenario below:

> > > > > > >>

> > > > > > >>

> > > > > > >> /mnt/cephfs_mountpoint type ceph (rw,relatime,name=ceph_filesystem,secret=<hidden>>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)

> > > > > > >>

> > > > > > >>

> > > > > > >> Test1:

> > > > > > >>

> > > > > > >> 1) unmount & remount:

> > > > > > >>

> > > > > > >>

> > > > > > >> 2) Add 10 x 100GB files to a directory:

> > > > > > >>

> > > > > > >>

> > > > > > >> for i in {1..10}; do dd if=/dev/zero of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done

> > > > > > >>

> > > > > > >>

> > > > > > >> 3) Delete all files in directory:

> > > > > > >>

> > > > > > >>

> > > > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done

> > > > > > >>

> > > > > > >>

> > > > > > >> 4) Immediately perform ls on directory:

> > > > > > >>

> > > > > > >>

> > > > > > >> time ls /mnt/cephfs_mountpoint/test1

> > > > > > >>

> > > > > > >>

> > > > > > >> Result: delay ~16 seconds

> > > > > > >>

> > > > > > >> real    0m16.818s

> > > > > > >>

> > > > > > >> user    0m0.000s

> > > > > > >>

> > > > > > >> sys     0m0.002s

> > > > > > >>

> > > > > > >>

> > > > > >

> > > > > > > Are cephfs metadata pool and data pool on the same set of OSD. Is is

> > > > > > > possible that heavy data IO slowed down metadata IO?

> > > > > >

> > > > > > Test results are from a new pre-production cluster that does not have any significant data IO. We've also confirmed the same behaviour on another cluster with similar configuration. Both clusters have separate device-class/crush rule for metadata
 pool using NVME OSDs, while the data pool uses HDD OSDs.

> > > > > > Most metadata operations are unaffected. It appears that it is only metadata operations on files that exist in client page cache prior to rm that are delayed.

> > > > > >

> > > > >

> > > > > Ok. Please enable kernel debug when running 'ls' and send kernel log to us.

> > > > >

> > > > > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;

> > > > > time /mnt/cephfs_mountpoint/test1

> > > > > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;

> > > > >

> > > > > Yan, Zheng

> > > >

> > > > Thanks Yan, Zheng

> > > > I've attached two logfiles as I ran the test twice.

> > > > The first time as previously described Test1 - cephfs_kern.log

> > > > The second time I dropped caches prior to rm as in previous Test2 - cephfs_drop_caches_kern.log

> > > >

> > >

> > > The log shows that client waited 16 seconds for readdir reply. please

> > > try again with debug mds/ms enabled and send both kerne log and mds

> > > log to us.

> > >

> > > before writing data to files, enable debug_mds and debug_ms (On the

> > > machine where mds.0 runs, run 'ceph daemon mds.x config set debug_mds

> > > 10; ceph daemon mds.x config set debug_ms 1')

> > > ...

> > > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control

> > > time ls /mnt/cephfs_mountpoint/test1

> > > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control

> > > disable debug_mds and debug_ms

> > >

> > > Yan, Zheng

> >

> > tarball of kernel log and mds debug log uploaded:

> > https://swift.rc.nectar.org.au:8888/v1/AUTH_42/cephfs_issue/mds_debug_kern_logs_20181022_2141.tar.gz?temp_url_sig=51f74f07c77346138a164ed229dc8a92f18bed8d&temp_url_expires=1545046086

> >

> > Thanks,

> > Dylan

> >

> 

> The log shows that mds sent reply immediately after receiving readdir

> request. But the reply message was delayed for 16 seconds. (mds sent 5

> messages at 2018-10-22 21:39:12, the last one is readdir reply.  The

> kclient received the first message at 18739.612013 and received the

> readdir reply at 18755.894441). The delay pattern is that

> kclient received a message, nothing happened for 4 seconds, received

> another one or two messages, nothing happened for another 4 seconds

> .....

> 

> Could please try using simple messenger . If it does not work, try

> disabling NIC's offload functions.

> 

> Regards

> Yan, Zheng

Sorry for the delay in following this up. I tried simple messenger but that did not make a difference. However, during further testing I concluded that the issue is not related to Ceph at all, so my apologies for that initial assumption. Just responding to
 the list now in case others encounter similar behaviour.

I believe the delays are simply a result of pagecache reclaim on large memory systems. The delay is only really noticeable when client system memory is >100GB and when files that are currently in pagecache are deleted. Our client hosts have ~900GB RAM.
 I tested booting those hosts using 'mem=8G' kernel boot option to limit the available RAM and no delay occurs after deletions (since the pagecache is so small).

Further, a very similar issue occurs on the same hosts while running the previously mentioned tests on a local RAID mount (i.e. completely unrelated to Ceph). The same aggregate delay occurs but during the rm command. That is, it appears that the rm command
 on the local RAID mount does not complete until the pagecache reclaim is complete. Whereas when deleting from the Ceph mount the rm returns almost immediately but a delay is incurred for subsequent stat commands (presumably while pagecache is being reclaimed).
 The total/aggregate delays for file deletions and subsequent commands on the local RAID mount and the Ceph mount were roughly the same.
So, if I've interpreted that correctly, it appears that there are certain circumstances where pagecache reclaim can be surprisingly expensive on large memory systems.

If there's a method to 100% confirm the above hypothesis I'd be more than happy to run some more tests. Otherwise, I don't think the delay will impact our production use case. It just manifests in testing due to frequently ingesting and immediately deleting
 data.

Thanks again for your help, it's much appreciated.

> > > >

> > > > > > >>

> > > > > > >> Test2:

> > > > > > >>

> > > > > > >>

> > > > > > >> 1) unmount & remount

> > > > > > >>

> > > > > > >>

> > > > > > >> 2) Add 10 x 100GB files to a directory

> > > > > > >>

> > > > > > >> for i in {1..10}; do dd if=/dev/zero of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done

> > > > > > >>

> > > > > > >>

> > > > > > >> 3) Either a) unmount & remount; or b) drop caches

> > > > > > >>

> > > > > > >>

> > > > > > >> echo 3 >>/proc/sys/vm/drop_caches

> > > > > > >>

> > > > > > >>

> > > > > > >> 4) Delete files in directory:

> > > > > > >>

> > > > > > >>

> > > > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done

> > > > > > >>

> > > > > > >>

> > > > > > >> 5) Immediately perform ls on directory:

> > > > > > >>

> > > > > > >>

> > > > > > >> time ls /mnt/cephfs_mountpoint/test1

> > > > > > >>

> > > > > > >>

> > > > > > >> Result: no delay

> > > > > > >>

> > > > > > >> real    0m0.010s

> > > > > > >>

> > > > > > >> user    0m0.000s

> > > > > > >>

> > > > > > >> sys     0m0.001s

> > > > > > >>

> > > > > > >>

> > > > > > >> Our understanding of ceph-fs’ file deletion mechanism, is that there should be no blocking observed on the client. http://docs.ceph.com/docs/mimic/dev/delayed-delete/ .

> > > > > > >>

> > > > > > >> It appears that if files are cached on the client, either by being created or accessed recently  it will cause the kernel client to block for reasons we have not identified.

> > > > > > >>

> > > > > > >>

> > > > > > >> Is this a known issue, are there any ways to mitigate this behaviour?

> > > > > > >>

> > > > > > >> Our production system relies on our client’s processes having concurrent access to the file system, and access contention must be avoided.

> > > > > > >>

> > > > > > >>

> > > > > > >> An old mailing list post that discusses changes to client’s page cache behaviour may be relevant.

> > > > > > >>

> > > > > > >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005692.html

> > > > > > >>

> > > > > > >>

> > > > > > >> Client System:

> > > > > > >>

> > > > > > >>

> > > > > > >> OS: RHEL7

> > > > > > >>

> > > > > > >> Kernel: 4.15.15-1

> > > > > > >>

> > > > > > >>

> > > > > > >> Cluster: Ceph: Luminous 12.2.8

> > > > > > >>

> > > > > > >>

> > > > > >

> > > > > >

> > > > > >

> > > > > > >> Thanks,

> > > > > > >>

> > > > > > >> Dylan

> > > > > > >>

> > > > > > >>

> > > > > > >> _______________________________________________

> > > > > > >> ceph-users mailing list

> > > > > > >> ceph-users at lists.ceph.com

> > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com