Re: Network design issues

Frank Schilder <frans@xxxxxx> · Fri, 12 Feb 2021 17:59:23 +0000

By the way, thanks for reminding me of bmon! Of course. I have a decent collection of live monitoring tools installed and bmon was one of the first. How could I forget?

Another tool I became good friends with is atop. It gives a really good overview of the entire system, including network, disks, swap paging, you name it. I forgot about that too.

Have a good weekend.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 12 February 2021 18:52:05
To: Stefan Kooman
Cc: ceph-users@xxxxxxx
Subject: Re:  Network design issues

Hi Stefan,

OK, I added the ceph-users again :)

Thanks for your reply, this is a lot of useful pointers. Yes, its Dell EMC switches running OS9 and I believe they support per-VLAN bandwidth reservations. It would be the easiest to configure and test. At the moment, I always see the slow ping times on both, the front- and back interface at the same time on exactly the same OSD pairs. If I reserve bandwidth to the replication VLAN and the slow ping times on the back interface disappear, this would be a really strong clue.

I will go through everything after the weekend.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 12 February 2021 18:18
To: Frank Schilder
Subject: Re:  Network design issues

On 2/12/21 5:27 PM, Frank Schilder wrote:
> Hi Stefan,
>
> do you want to keep this out of the ceph-users list or was it a click-and-miss?

^^ This, I recently switched to Thunderbird because of mail migration
(from Mutt) ... and I'm not used to it yet. I *tried* to reply to all
(incl. list) but might have screwed up.

I would consider this as of general interest.
>
> Thanks for your detailed reply. I take it that I need to provide more info and will try to make a few sketches of the architecture. I think it will help explaining the problem. Some quick replies:
>
>> I'm curious what you changed. Want to share it?
>
> # ceph config set mds mds_max_caps_per_client 65536
>
> Thread "cephfs: massive drop in MDS requests per second with increasing number of caps"
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/B7K6B5VXM3I7TODM4GRF3N7S254O5ETY/#WSHVKZIX6QUKJ7XYD45B62VAO6U4UOEE

Ah yes, I've read that thread. Interesting. I haven't tested it out
yet, but will do so.

>
> There are a number of config values with significantly too large defaults, this is one of them. Another one is mon_sync_max_payload_size.

quite a few people have run into issues with that settig. We haven't had
any issues with it yet. But perhaps I should downscale it as well.

>
>> Do you know what causes the slow ops?
>
> I don't care about slow ops under high load, these are to be expected. I worry about "slow ping times". These are not expected and are almost certainly caused by congestion of a link.

Yeah sure, I would suspect that as well. Or "discards" from a switch
because of errors, but those are less likely.

>
>> I don't quite get the 10G bottleneck. Sure, a client can saturate a 10
>> Gb/s link, but how does this affect storage <-> storage (replication)
>> traffic and / or other clients?
>
> Because it all happens on the same physical link. We don't have a dedicated replication network. Its all mixed on the same hardware. If a 10G link is saturated, nothing moves any more through this particular link and the clients are so superior in capacity that they can easily starve parts of the internal ceph traffic in this way.
>
> Basically, we started out with a dedicated replication VLAN and decided to merge this with the access VLAN for simplicity of the set-up. Our networking is currently equivalent to having a single network only. Here the interfaces:
>
> ceph0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
>
> ceph0.81@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
>
> ceph0.82@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
>
> $ ethtool ceph0
> Settings for ceph0:
>          Supported ports: [ ]
>          Supported link modes:   Not reported
>          Supported pause frame use: No
>          Supports auto-negotiation: No
>          Supported FEC modes: Not reported
>          Advertised link modes:  Not reported
>          Advertised pause frame use: No
>          Advertised auto-negotiation: No
>          Advertised FEC modes: Not reported
>          Speed: 60000Mb/s
>          Duplex: Full
>          Port: Other
>          PHYAD: 0
>          Transceiver: internal
>          Auto-negotiation: off
> Cannot get wake-on-lan settings: Operation not permitted
>          Link detected: yes
>
> the bond is 6x10G active-active, VLAN 81 is the access VLAN and 82 is the replication network. Goes all over the same lines. This config is very convenient for maintenance, but seems to suffer from not physically reserving bandwidth to VLAN 82. Maybe such a bandwidth reservation QOS definition could already help?

Are these Dell / EMC switches? You might be able to give priority on a
VLAN level, or "shape" bandwith based on VLANs. I know that Aristas
(that we use) have support for that in newish firmware. You might also
want to support "pause" frames (ethernet flow control) as that might
help during congestion (a back off protocol), see:
https://en.wikipedia.org/wiki/Ethernet_flow_control

Just to note: we don't have a separate replication network / interfaces.
We only have one network. Wido (den Hollander) and myself as well, don't
see any added benefit of a separate network. You only waste bandwith if
you split them up. And make debugging more complex in certain failure
scenarios. Do you know what hashing is in use for the LACP port-channel?
You want to use mac, ip and port (5-tuple). We use OpenvSwitch (OVS) a
lot, and with OVS you can balance the load between the LACP links (by
default it evaluates every 10 seconds if it should move flows around).

I doubt there is a silver bullet, but hey, you never know. Do change one
thing at a time, otherwise it will be hard to know what the effect is of
each of the changes (they might even cancel each other out).

>
> I will provide a sketch of the set-up, I think this will make things more clear. I don't think we have an aggregated bandwidth problem, I believe what we have is a load distribution/priority problem over physical link members in the aggregation group "ceph0" on the storage servers.

Yes, your issue makes more sense to me now. Do you have any metrics from
the load on the individual links? Even bmon might be a useful tool. You
might want to capture metrics (like every second or so) to detect
"bursts" of traffic that might cause issues. Just to make sure you are
on the right track. We use telegraf as metric collecting agent sending
them to influxdb, but there are many more options.

And then there are also other things to tune: tcp checksum offload et
al. You might also hit IRQ balance issue, and there are also ways to
overcome those. Are those single CPU systems? And / or AMD? NUMA might
be a thing as well, and ideally you have the Ceph OSD daemons pinned to
the CPU with the network / storage adapters connected.

Finally, this might be of use:
http://www.brendangregg.com/usemethod.html ;-).

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx