Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 250000 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +0000, Dan Van Der Ster wrote: > Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for > 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29.00000000 [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576.00000000 [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d.00000000 [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e.00000000 [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a.00000000 [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for > 10.000073 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.000073 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca.00000000 [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.000028 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb.00000000 [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for > 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403.00000000 [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406.00000000 [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407.00000000 [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 B/s rd, 5534 kB/s wr, 370 op/s 2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 B/s rd, 8121 kB/s wr, 367 op/s 2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s rd, 7552 kB/s wr, 444 op/s > What about monitor elections? No, that's been reporting "monmap e3" and "election epoch 130" for a week or two. I assume that to mean we've had no elections. We're actually running without one monitor at the moment, because one machine is down, but we've had the same problems with the machine present. > Are your MDSs using a lot of CPU? No, they're showing load averages well under 1 the whole time. Peak load average is about 0.6. > did you try tuning anything in the MDS (I think the default config > is still conservative, and there are options to cache more entries, > etc?) Not much. We have: [global] mds inline_data = true mds shutdown check = 2 mds cache size = 750000 [mds] mds client prealloc inos = 10000 > What about iostat on the OSDs ? are your OSD disks busy reading or > writing during these incidents? Not sure. I don't think so, but I'll try to trigger an incident and report back on this one. > What are you using for OSD journals? On each machine, the three OSD journals live on the same ext4 filesystem on an SSD, which is also the root filesystem of the machine. > Also check the CPU usage for the mons and osds... The mons are doing pretty much nothing in terms of CPU, as far as I can see. I will double-check during an incident. > Does your hardware provide enough IOPS for what your users need? > (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. > If disabling deep scrub helps, then it might be that something else > is reading the disks heavily. One thing to check is updatedb ? we > had to disable it from indexing /var/lib/ceph on our OSDs. I haven't seen that running at all during the day, but I'll look into it. Hugo. > Best Regards, > Dan > > -- Dan van der Ster || Data & Storage Services || CERN IT Department -- > > > On 20 Aug 2014, at 16:39, Hugo Mills <h.r.mills at reading.ac.uk> wrote: > > > We have a ceph system here, and we're seeing performance regularly > > descend into unusability for periods of minutes at a time (or longer). > > This appears to be triggered by writing large numbers of small files. > > > > Specifications: > > > > ceph 0.80.5 > > 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) > > 2 machines running primary and standby MDS > > 3 monitors on the same machines as the OSDs > > Infiniband to about 8 CephFS clients (headless, in the machine room) > > Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop > > machines, in the analysis lab) > > > > The cluster stores home directories of the users and a larger area > > of scientific data (approx 15 TB) which is being processed and > > analysed by the users of the cluster. > > > > We have a relatively small number of concurrent users (typically > > 4-6 at most), who use GUI tools to examine their data, and then > > complex sets of MATLAB scripts to process it, with processing often > > being distributed across all the machines using Condor. > > > > It's not unusual to see the analysis scripts write out large > > numbers (thousands, possibly tens or hundreds of thousands) of small > > files, often from many client machines at once in parallel. When this > > happens, the ceph cluster becomes almost completely unresponsive for > > tens of seconds (or even for minutes) at a time, until the writes are > > flushed through the system. Given the nature of modern GUI desktop > > environments (often reading and writing small state files in the > > user's home directory), this means that desktop interactiveness and > > responsiveness for all the other users of the cluster suffer. > > > > 1-minute load on the servers typically peaks at about 8 during > > these events (on 4-core machines). Load on the clients also peaks > > high, because of the number of processes waiting for a response from > > the FS. The MDS shows little sign of stress -- it seems to be entirely > > down to the OSDs. ceph -w shows requests blocked for more than 10 > > seconds, and in bad cases, ceph -s shows up to many hundreds of > > requests blocked for more than 32s. > > > > We've had to turn off scrubbing and deep scrubbing completely -- > > except between 01.00 and 04.00 every night -- because it triggers the > > exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets > > up to 7 PGs being scrubbed, as it did on Monday, it's completely > > unusable. > > > > Is this problem something that's often seen? If so, what are the > > best options for mitigation or elimination of the problem? I've found > > a few references to issue #6278 [1], but that seems to be referencing > > scrub specifically, not ordinary (if possibly pathological) writes. > > > > What are the sorts of things I should be looking at to work out > > where the bottleneck(s) are? I'm a bit lost about how to drill down > > into the ceph system for identifying performance issues. Is there a > > useful guide to tools somewhere? > > > > Is an upgrade to 0.84 likely to be helpful? How "development" are > > the development releases, from a stability / dangerous bugs point of > > view? > > > > Thanks, > > Hugo. > > > > [1] http://tracker.ceph.com/issues/6278 > > -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building