slow requests

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 23 May 2014 11:51:53 -0700

On 5/22/14 11:51 , Gy?rv?ri G?bor wrote:
> Hello,
>
> Got this kind of logs in two node of 3 node cluster both node has 2 
> OSD, only affected 2 OSD on two separate node thats why i dont 
> understand the situation. There wasnt any extra io on the system at 
> the given time.
>
> Using radosgw with s3 api to store objects under ceph average ops 
> around 20-150 and bw usage 100-2000kb read / sec and only 50-1000kb / 
> sec written.
>
> osd_op(client.7821.0:67251068 
> default.4181.1_products/800x600/537e28022fdcc.jpg [cmpxattr 
> user.rgw.idtag (22) op 1 mode 1,setxattr user.rgw.idtag (33),call 
> refcount.put] 11.fe53a6fb e590) v4 *currently waiting for subops from 
> [2] **
> *

Are any of your PGs in recovery or backfill?

I've seen this happen two different ways.  The first time was because I 
had the recovery and backfill parameters set too high for my cluster.  
If your journals aren't SSDs, the default parameters are too high.  The 
recovery operation will use most of the IOps, and starve the clients.

The second time I saw this is when one disk was starting to fail. 
Sectors starting failing, and the drive spent a lot of time reading and 
remapping bad sectors.  Consumer class SATA disks will retry bad sectors 
for 30+ second.  It happens in the drive firmware, so it's not something 
you can stop.  Enterprise class drives will give up quicker, since they 
know you have another copy of the data.  (Nobody uses enterprise class 
drives stand-alone; they're always in some sort of storage array).

I've had reports of 6+ OSDs blocking subops, and I traced it back to one 
disk that was blocking others.  I replaced that disk, and the warnings 
went away.

If your cluster is healthy, check the SMART attributes for osd.2. If 
osd.2 looks good, it might another osd.  Check osd.2 logs, and check any 
osd that are blocking osd.2.  If your cluster is small, it might be 
faster to just check all disks instead of following the trail.

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140523/91150a3b/attachment.htm>