Re: ceph-fuse and its memory usage

Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> · Wed, 14 Oct 2015 16:09:52 +1100

    Hi all...

    Thank you for the feedback, and I am sorry for my delay in replying.

    1./ Just to recall the problem, I was testing cephfs using fio in
    two ceph-fuse clients: 

    - Client A is in the same data center as all OSDs
      connected at 1 GbE 

      - Client B is in a different data center (in another city) also
      connected at 1 GbE. However, I've seen that the connection is
      problematic,
      and sometimes, the network performance is well bellow the
      theoretical 1 Gbps
      limit. 

      - Client A has 24 GB RAM + 98 GB of SWAP and client B has 48 GB of
      RAM +
      98 GB of SWAP

        and I was seeing that Client B was giving much better fio
    throughput because it was hitting the cache much more than Client A.

    --- * ---

    2./ I was suspecting that Client B was hitting the cache because it
    had bad connectivity to the Ceph Cluster. I actually tried to sort
    that out and I was able to nail down a problem in a bad switch.
    However, after that, I still see the same behaviour which I can
    reproduce in a systematic way.

    --- * ---

    3./ In a new round of tests in Client B, I've applied the following
    procedure:

        3.1/ This is the network statistics right before starting my fio
    test:

    * Printing network statistics:

      * /sys/class/net/eth0/statistics/collisions: 0

      * /sys/class/net/eth0/statistics/multicast: 453650

      * /sys/class/net/eth0/statistics/rx_bytes: 437704562785

      * /sys/class/net/eth0/statistics/rx_compressed: 0

      * /sys/class/net/eth0/statistics/rx_crc_errors: 0

      * /sys/class/net/eth0/statistics/rx_dropped: 0

      * /sys/class/net/eth0/statistics/rx_errors: 0

      * /sys/class/net/eth0/statistics/rx_fifo_errors: 0

      * /sys/class/net/eth0/statistics/rx_frame_errors: 0

      * /sys/class/net/eth0/statistics/rx_length_errors: 0

      * /sys/class/net/eth0/statistics/rx_missed_errors: 0

      * /sys/class/net/eth0/statistics/rx_over_errors: 0

      * /sys/class/net/eth0/statistics/rx_packets: 387690140

      * /sys/class/net/eth0/statistics/tx_aborted_errors: 0

      * /sys/class/net/eth0/statistics/tx_bytes: 149206610455

      * /sys/class/net/eth0/statistics/tx_carrier_errors: 0

      * /sys/class/net/eth0/statistics/tx_compressed: 0

      * /sys/class/net/eth0/statistics/tx_dropped: 0

      * /sys/class/net/eth0/statistics/tx_errors: 0

      * /sys/class/net/eth0/statistics/tx_fifo_errors: 0

      * /sys/class/net/eth0/statistics/tx_heartbeat_errors: 0

      * /sys/class/net/eth0/statistics/tx_packets: 241698327

      * /sys/class/net/eth0/statistics/tx_window_errors: 0

        3.2/ I've then launch my fio test. Please note that I am
    dropping caches before starting the test (sync; echo 3 >
    /proc/sys/vm/drop_caches). My current fio test has nothing fancy.
    Here are the options:

    # cat
      fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.in

[fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036]

      ioengine=libaio

      iodepth=64

      rw=write

      bs=512K

      direct=1

      size=8192m

      numjobs=128

filename=fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.data

        I am no sure if it matters, but the layout of my dir is the
    following:

    # getfattr -n ceph.dir.layout /cephfs/sydney

      getfattr: Removing leading '/' from absolute path names

      # file: cephfs/sydney

      ceph.dir.layout="stripe_unit=524288 stripe_count=8
      object_size=4194304 pool=cephfs_dt"

        3.3/ fio produced the following result for the aggregated
    bandwidth. If I translate that number to Gbps, I get almost 3 Gbps
    which is impossible.

    # grep aggrb
      fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.out

        WRITE: io=1024.0GB, aggrb=403101KB/s, minb=3149KB/s,
      maxb=3154KB/s, mint=2659304msec, maxt=2663699msec

        3.4 This is the network statistics immediately after the test

    * Printing network statistics:

      * /sys/class/net/eth0/statistics/collisions: 0

      * /sys/class/net/eth0/statistics/multicast: 454539

      * /sys/class/net/eth0/statistics/rx_bytes: 440300506875

      * /sys/class/net/eth0/statistics/rx_compressed: 0

      * /sys/class/net/eth0/statistics/rx_crc_errors: 0

      * /sys/class/net/eth0/statistics/rx_dropped: 0

      * /sys/class/net/eth0/statistics/rx_errors: 0

      * /sys/class/net/eth0/statistics/rx_fifo_errors: 0

      * /sys/class/net/eth0/statistics/rx_frame_errors: 0

      * /sys/class/net/eth0/statistics/rx_length_errors: 0

      * /sys/class/net/eth0/statistics/rx_missed_errors: 0

      * /sys/class/net/eth0/statistics/rx_over_errors: 0

      * /sys/class/net/eth0/statistics/rx_packets: 423468075

      * /sys/class/net/eth0/statistics/tx_aborted_errors: 0

      * /sys/class/net/eth0/statistics/tx_bytes: 425580907716

      * /sys/class/net/eth0/statistics/tx_carrier_errors: 0

      * /sys/class/net/eth0/statistics/tx_compressed: 0

      * /sys/class/net/eth0/statistics/tx_dropped: 0

      * /sys/class/net/eth0/statistics/tx_errors: 0

      * /sys/class/net/eth0/statistics/tx_fifo_errors: 0

      * /sys/class/net/eth0/statistics/tx_heartbeat_errors: 0

      * /sys/class/net/eth0/statistics/tx_packets: 423973681

      * /sys/class/net/eth0/statistics/tx_window_errors: 0

        If I just compared the tx_bytes before and after the fio test, I
    get (425580907716−149206610455) ~ 260 GB. The whole test is supposed
    to use 128 threads writing 8 MB files each giving a total of 1024
    GB. I am not sure if I understand those numbers by they are not
    matching by a factor of 4 and I also do not uinderstand how caching
    could compensate that difference. 

    --- * ---

    4./ During the whole process I have been monitoring ceph-fuse memory
    usage, and this what I get at the beginning and end of the test:

    START: 4577 root      20   0 5861m  54m 4380 S 97.5 
      0.1   0:05.93 ceph-fuse      

      END: 4577 root      20   0 10.1g 4.5g 4412 S  0.0  9.5  30:48.27
      ceph-fuse

    --- * ---

    5./ I've tried to manipulate the ceph-fuse behaviour via
    client_cache_size and client_oc_size (by the way, are these values
    given in bytes?). The defaults are

    client cache size = 16384

      client oc size = 209715200

    and I've decreased both by a factor of 4 but I kept seeing the same
    behavior.

    At this point, I do not have a clear idea why this is happening.

    Cheers

    Goncalo

    On 10/03/2015 04:03 AM, Gregory Farnum
      wrote:

      On Fri, Oct 2, 2015 at 1:57 AM, John Spray <jspray@xxxxxxxxxx> wrote:

        On Fri, Oct 2, 2015 at 2:42 AM, Goncalo Borges
<goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

          Dear CephFS Gurus...

I have a question regarding ceph-fuse and its memory usage.

1./ My Ceph and CephFS setups are the following:

Ceph:
a. ceph 9.0.3
b. 32 OSDs distributed in 4 servers (8 OSD per server).
c. 'osd pool default size = 3' and 'osd pool default min size = 2'
d. All servers running Centos6.7

CephFS:
e. a single mds
f. dedicated pools for data and metadata
g. clients in different locations / sites mounting CephFS via ceph-fuse
h. All servers and clients running Centos6.7

2./ I have been running fio tests in two CephFS clients:
    - Client A is in the same data center as all OSDs connected at 1 GbE
    - Client B is in a different data center (in another city) also
connected at 1 GbE. However, I've seen that the connection is problematic,
and sometimes, the network performance is well bellow the theoretical 1 Gbps
limit.
    - Client A has 24 GB RAM + 98 GB of SWAP and client B has 48 GB of RAM +
98 GB of SWAP

3./ I have been running some fio write tests (with 128 threads) in both
clients, and surprisingly, the results show that the aggregated throughput
is better for client B than client A.

CLIENT A results:
# grep agg
fio128threadsALL/fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151001015558.out
WRITE: io=1024.0GB, aggrb=114878KB/s, minb=897KB/s, maxb=1785KB/s,
mint=4697347msec, maxt=9346754msec

CLIENT B results:
#  grep agg
fio128threadsALL/fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151001015555.out
WRITE: io=1024.0GB, aggrb=483254KB/s, minb=3775KB/s, maxb=3782KB/s,
mint=2217808msec, maxt=2221896msec

4./ If I actually monitor the memory usage of ceph-fuse during the I/O
tests, I see that

CLIENT A: ceph-fuse does not seem to go behond 7GB of VMEM and 1 GB of RMEM.
CLIENT B: ceph-fuse uses 11 GB of VMEM and 7 GB of RMEM.

5./ These results make me think that caching is playing a critical role in
these results.

My questions are the following:

a./ Why CLIENT B uses more memory than CLIENT A? My hint is that there is a
network bottleneck between CLIENT B and the Ceph Cluster, and memory is more
used because of that.

        This is weird, and I don't have an explanation.  I would be surprised
if network latency was influencing timing enough to create such a
dramatic difference in caching behaviour.

Are both clients running the same version of ceph-fuse and the same
version of the kernel?

      Yeah at 483254KB/s (~480MB/s!) you're clearly exceeding how much the
network can actually support and are just writing into RAM. Something
in the stack is not actually forcing the data to go out to the OSDs
before being acknowledged. Check that all your fio settings are the
same and that directIO is actually doing what you expect, that you
haven't disabled any of that stuff in the kernel, etc.
-Greg

          b/ Is the FIO write performance better in CLIENT B a consequence of the fact
that it is using more memory than client A?

        Seems a reasonable inference, but it's still all very weird!

          c./ Is there a parameters we can set for the CEPHFS clients to limit the
amount of memory they can use?

        You can limit the caching inside ceph-fuse by setting
client_cache_size (metadata cache entries) and client_oc_size (max
data cache).  However, there'll also be some caching inside the kernel
(which you can probably control somehow but I don't know off the top
of my head).

Cheers,
John

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com