RE: Cache tiering slow request issue: currently waiting for rw locks

"Wang, Zhiqiang" <zhiqiang.wang@xxxxxxxxx> · Mon, 1 Sep 2014 01:33:20 +0000

I don't think the object context is blocked at that time. It is un-blocked after the copying of data from base tier. It doesn't address the problem here. Anyway, I'll try it and see.

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
Sent: Saturday, August 30, 2014 10:29 AM
To: Wang, Zhiqiang
Cc: 'ceph-devel@xxxxxxxxxxxxxxx'
Subject: Re: Cache tiering slow request issue: currently waiting for rw locks

Hi,

Can you take a look at https://github.com/ceph/ceph/pull/2363 and see if that addresses the behavior you saw?

Thanks!
sage

On Fri, 29 Aug 2014, Sage Weil wrote:

> Hi,
> 
> I've opened http://tracker.ceph.com/issues/9285 to track this.
> 
> I think you're right--we need a check in agent_maybe_evict() that will 
> skip objects that are being promoted.  I suspect a flag on the 
> ObjectContext is enough?
> 
> sage
> 
> 
> On Fri, 29 Aug 2014, Wang, Zhiqiang wrote:
> 
> > Hi all,
> > 
> > I've ran into this slow request issue some time ago. The problem is like this: when running with cache tieing, there are 'slow request' warning messages in the log file like below.
> > 
> > 2014-08-29 10:18:24.669763 7f9b20f1b700  0 log [WRN] : 1 slow 
> > requests, 1 included below; oldest blocked for > 30.996595 secs
> > 2014-08-29 10:18:24.669768 7f9b20f1b700  0 log [WRN] : slow request 
> > 30.996595 seconds old, received at 2014-08-29 10:17:53.673142: 
> > osd_op(client.114176.0:144919 rb.0.17f56.6b8b4567.000000000935 
> > [sparse-read 3440640~4096] 45.cf45084b ack+read e26168) v4 currently 
> > waiting for rw locks
> > 
> > Recently I made some changes to the log, captured this problem, and finally figured out its root cause. You can check the attachment for the logs.
> > 
> > Here is the root cause:
> > There is a cache miss when doing read. During promotion, after copying the data from base tier osd, the cache tier primary osd replicates the data to other cache tier osds. Some times this takes quite a long time. During this period of time, the promoted object may be evicted because the cache tier is full. When the primary osd finally gets the replication response and restarts the original read request, it doesn't find the object in the cache tier, and do promotion again. This loops for several times, and we'll see the 'slow request' in the logs. Theoretically, this could loops forever, and the request from the client would never be finished.
> > 
> > There is a simple fix for this:
> > Add a field in the object state, indicating the status of the promotion. It's set to true after the copy of data from base tier and before the replication. It's reset to false after the replication and the original client request starts to execute. Evicting is not allowed when this field is true.
> > 
> > What do you think?
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html