Just as a weird update to this, I accidentally left the scrub cron script disabled after the testing described in the previous message. Even with *no* deep scrubs running, the “REQUEST_SLOW” problem is still occurring every few minutes. It seems something is seriously wrong with this cluster. In an attempt to further understand this, I’ve turned up some of the OSD debugging settings, and I’m trying to figure out what the results mean. Most of the slow requests I see are in the state “queued_for_pg.” According to the troubleshooting OSDs page, “queued_for_pg” means that “the op has been put into the queue for processing by its PG.” So if these ops are just waiting in a queue, they are probably a victim of the slowness, rather than the cause. The next thing I tried was to capture all of the in-flight ops during a period of slowness via “ceph daemon osd.# ops” run as close to simultaneously as possible for every OSD, to try to find requests that are taking a long time, but that are not in queue. That hasn’t proved very successful; I notice certain delays with that and then when the results come back, it’s about the time the status reports as healthy, and none of the requests captured are >32 seconds old… usually only about 20 seconds, and they’re all either queued_for_pg or saw long delays in queued_for_pg but now appear to be completing normally/rapidly. One possibility is that whatever is delaying requests is also delaying processing of the ops command. Another possibility is that my timing and/or luck is just bad with that, since I have to start it by hand. A third possibility is that the queue involved is a priority queue and that instead of one slow “smoking gun” request blocking everybody, a series of short higher-priority requests are collectively starving other normal or lower priority requests. Is that how it works? Or is the queue in question a simple FIFO queue? Is there anything else I can try to help narrow this down? Thanks! On Sat, Oct 14, 2017 at 6:51 PM, J David <j.david.lists@xxxxxxxxx> wrote: > On Sat, Oct 14, 2017 at 9:33 AM, David Turner <drakonstein@xxxxxxxxx> wrote: >> First, there is no need to deep scrub your PGs every 2 days. > > They aren’t being deep scrubbed every two days, nor is there any > attempt (or desire) to do so. That would be require 8+ scrubs running > at once. Currently, it takes between 2 and 3 *weeks* to deep scrub > every PG one at a time with no breaks. Perhaps you misread “48 days” > as “48 hours?” > > As long as having one deep scrub running renders the cluster unusable, > the frequency of deep scrubs doesn’t really matter; “ever” is too > often. If that issue can be resolved, the cron script we wrote will > scrub all the PG’s over a period of 28 days. > >> I'm thinking your 1GB is either a typo for a 1TB disk or that your DB >> partitions are 1GB each. > > That is a typo, yes. The SSDs are 100GB (really about 132GB, with > overprovisioning), and each one has three 30GB partitions, one for > each OSD on that host. These SSDs perform excellently in testing and > in other applications. They are being utilized <1% of their I/O > capacity (by both IOPS and throughput) by this ceph cluster. So far > there hasn’t been any thing we’ve seen suggesting there’s a problem > with these drives. > >> Third, when talking of a distributed storage system you can never assume it >> isn’t the network. > > No assumption is necessary; the network has been exhaustively tested, > both with and without ceph running, both with and without LACP. > > The network topology is dirt simple. There’s a dedicated 10Gbps > switch with 6 two-port LACPs connected to five ceph nodes, one client, > and nothing else. There are no interface errors, overruns, link > failures or LACP errors on any of the cluster nodes or on the switch. > Like the SSDs (and the CPUs, and the RAM), the network passes all > tests thrown at it and is being utilized by ceph to a very small > fraction of its demonstrated capacity. > > But, it’s not a sticking point. The LAN has now been reconfigured to > remove LACP and use each of the ceph nodes’ 10Gbps interfaces > individually, one as public network, one as cluster network, with > separate VLANs on the switch. That’s all confirmed to have taken > effect after a full shutdown and restart of all five nodes and the > client. > > That change had no effect on this issue. > > With that change made, the network was re-tested by setting up 20 > simultaneous iperf sessions, 10 clients and 10 servers, with each > machine participating in 4 10-minute tests at once: inbound public > network, outbound public network, inbound cluster network, outbound > cluster network. With all 20 tests running simultaneously, the > average throughput per test was 7.5Gbps. (With 10 unidirectional > tests, the average throughput is over 9Gbps.) > > The client (participating only on the public network) was separately > tested. With five sequential runs, each run testing inbound and > outbound simultaneously between the client and one of the five ceph > nodes, in each case, the results were over 7Gbps in each direction. > > No loss, errors or drops were observed on any interface, nor on the > switch, during either test. > > So it does not appear that there are any network problems contributing > to the issue. > > Thanks! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com