slow requests

scr34m@xxxxxxxxxxxxx (Győrvári Gábor) · Sat, 24 May 2014 07:21:08 +0200

Hello,

No i dont see any backfill log in ceph.log during that period, drives 
are WD2000FYYZ-01UL1B1 but i did not find any informations in SMART, and 
yes i will check other drives too.

Could i determine somehow, in which PG placed the file?

Thanks

2014.05.23. 20:51 keltez?ssel, Craig Lewis ?rta:
> On 5/22/14 11:51 , Gy?rv?ri G?bor wrote:
>> Hello,
>>
>> Got this kind of logs in two node of 3 node cluster both node has 2 
>> OSD, only affected 2 OSD on two separate node thats why i dont 
>> understand the situation. There wasnt any extra io on the system at 
>> the given time.
>>
>> Using radosgw with s3 api to store objects under ceph average ops 
>> around 20-150 and bw usage 100-2000kb read / sec and only 50-1000kb / 
>> sec written.
>>
>> osd_op(client.7821.0:67251068 
>> default.4181.1_products/800x600/537e28022fdcc.jpg [cmpxattr 
>> user.rgw.idtag (22) op 1 mode 1,setxattr user.rgw.idtag (33),call 
>> refcount.put] 11.fe53a6fb e590) v4 *currently waiting for subops from 
>> [2] **
>> *
>
> Are any of your PGs in recovery or backfill?
>
> I've seen this happen two different ways.  The first time was because 
> I had the recovery and backfill parameters set too high for my 
> cluster.  If your journals aren't SSDs, the default parameters are too 
> high.  The recovery operation will use most of the IOps, and starve 
> the clients.
>
> The second time I saw this is when one disk was starting to fail. 
> Sectors starting failing, and the drive spent a lot of time reading 
> and remapping bad sectors.  Consumer class SATA disks will retry bad 
> sectors for 30+ second.  It happens in the drive firmware, so it's not 
> something you can stop.  Enterprise class drives will give up quicker, 
> since they know you have another copy of the data.  (Nobody uses 
> enterprise class drives stand-alone; they're always in some sort of 
> storage array).
>
> I've had reports of 6+ OSDs blocking subops, and I traced it back to 
> one disk that was blocking others.  I replaced that disk, and the 
> warnings went away.
>
>
> If your cluster is healthy, check the SMART attributes for osd.2. If 
> osd.2 looks good, it might another osd.  Check osd.2 logs, and check 
> any osd that are blocking osd.2.  If your cluster is small, it might 
> be faster to just check all disks instead of following the trail.
>
>
>
> -- 
>
> *Craig Lewis*
> Senior Systems Engineer
> Office +1.714.602.1309
> Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>
>
> *Central Desktop. Work together in ways you never thought possible.*
> Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
> <http://www.twitter.com/centraldesktop>  | Facebook 
> <http://www.facebook.com/CentralDesktop>  | LinkedIn 
> <http://www.linkedin.com/groups?gid=147417>  | Blog 
> <http://cdblog.centraldesktop.com/>
>

-- 
Gy?rv?ri G?bor - Scr34m
scr34m at frontember.hu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140524/449fe912/attachment.htm>