Hello, On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote: > Hi, > > we’re currently expanding our cluster to grow the number of IOPS we can provide to clients. We’re still on Hammer but in the process of upgrading to Jewel. You might want to wait until the next Jewel release, given the current number of issues. >We started adding pure-SSD OSDs in the last days (based on MICRON S610DC-3840) and the slow requests we’ve seen in the past have started to show a different pattern. > I looked in the archives and can't find a full description of your cluster hardware in any posts from you or the other Christian (hint, hint). Slow requests can nearly all the time being traced back to HW issues/limitations. Have you tested these SSDs for their suitability as a Ceph journal (I assume inline journals)? This has been discussed here countless times, google. If those SSDs aren't up to snuff, they may be worse than plain disks in some cases. In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when factoring journal and FS overheads and write amplification scenarios. I'd be worried about wearing these out long before 5 years are over. > I’m currently seeing those: > > 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 : cluster [WRN] 5 slow requests, 1 included below; oldest blocked for > 31.675358 secs > 2016-12-05 15:13:37.527478 osd.60 172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886 seconds old, received at 2016-12-05 15:13:05.852525: osd_op(client.518589944.0:2734750 rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1892352~4096] 277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently waiting for rw locks > Is osd.60 one of these SSDs? Tools like atop and iostat can give you a good insight on how your storage subsystem is doing. The moment you see slow requests, your cluster has issues you need to address. Especially if this happens when no scrubs are going on (restrict scrubs to off-peak hours). The 30 seconds is an arbitrary value and the WRN would warrant an ERR in my book, as some applications take a very dim view on being blocked for more than few seconds. Christian > As slow requests is something that happens a lot to us, I’m willing to invest some time to understand this more in-depth. I’d be happy to either write an open source tool to help interpreting diagnosing those, or at least write a blog post. The documentation and google don't tell much about the way to interpret those messages. > > So. Two questions: > > - any hint (beside from meticuluously reading the source) on interpreting those slow request messages in detail? > - specifically the “waiting for rw locks” is something that’s new to us - can someone enlighten me that it means given the message above? > > Cheers, > Christian > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com