Re: Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 19 Mar 2018 17:48:33 +0800

The MDS has to write to its local journal when clients open files, in
case of certain kinds of failures.

I guess it doesn't distinguish between read-only (when it could
*probably* avoid writing them down? Although it's not as simple a
thing as it sounds) and writeable file opens. So every file you're
opening requires the MDS to commit to disk, and it apparently filled
up its allowable mds log size and now you're stuck on that inter-DC
link. A temporary workaround might be to just keep turning up the mds
log sizes, but I'm sort of surprised it was absorbing stuff at a
useful rate before, so I don't know if changing those will help or
not.
-Greg

On Mon, Mar 19, 2018 at 5:01 PM, Nicolas Huillard <nhuillard@xxxxxxxxxxx> wrote:
> Hi all,
>
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
>
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
>
> Test : I'm using a single rsync on one of the client servers (the other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just to
> use a potentially more recent cephfs client than on stock 4.9), to the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
>
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
>
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".
>
> I wonder why cephfs would write anything to the metadata (I'm mounting
> on the clients with "noatime"), while I'm just reading data from it...
> What could I tune to reduce that write-load-while-reading-only ?
>
> --
> Nicolas Huillard
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com