Re: Really slow cache-evict

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 23 Oct 2015 18:00:31 -0700 (PDT)

On Fri, 23 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> We are testing out cache tiering, but when evicting the cache on an
> idle cluster it is extremely slow (10 objects per minutes). Looking at
> top some of the OSD processes are busy, but the disks are idle across
> the cluster. I upped debug_osd to 20/20 and saw millions of messages
> per object like:
> 
> 2015-10-23 16:52:15.952561 7fa42a1fd700 20 osd.169 39454
> should_share_map client.329656 10.208.16.31:0/1012854 39454
> 2015-10-23 16:52:15.952605 7fa42a1fd700 15 osd.169 39454 enqueue_op
> 0x287ba400 prio 63 cost 450560 latency 0.000999
> osd_op(client.329656.0:165865996
> rbd_data.14143b5faf8a00.000000000000041f [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 860160~450560]
> 12.549948fa ack+ondisk+wr
> ite+known_if_redirected e39454) v5
> 2015-10-23 16:52:15.952642 7fa43fbeb700 10 osd.169 39454 dequeue_op
> 0x287ba400 prio 63 cost 450560 latency 0.001036
> osd_op(client.329656.0:165865996
> rbd_data.14143b5faf8a00.000000000000041f [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 860160~450560]
> 12.549948fa ack+ondisk+wr
> ite+known_if_redirected e39454) v5 pg pg[12.fa( v 39422'12054
> (39253'9006,39422'12054] local-les=39453 n=284 ec=31174 les/c
> 39453/39453 39452/39452/39442) [169,186,153,136] r=0 lpr=39452
> crt=39422'12051 lcod 0'0 mlcod 0'0 active+clean]
> 2015-10-23 16:52:15.952676 7fa43fbeb700 20 osd.169 pg_epoch: 39454
> pg[12.fa( v 39422'12054 (39253'9006,39422'12054] local-les=39453 n=284
> ec=31174 les/c 39453/39453 39452/39452/39442) [169,186,153,136] r=0
> lpr=39452 crt=39422'12051 lcod 0'0 mlcod 0'0 active+clean]
> op_has_sufficient_caps pool=12
> (ssd-pool ) owner=0 need_read_cap=1 need_write_cap=1
> need_class_read_cap=0 need_class_write_cap=0 -> NO
> 2015-10-23 16:52:15.952688 7fa42a1fd700 20 osd.169 39454
> should_share_map client.329656 10.208.16.31:0/1012854 39454
> 2015-10-23 16:52:15.952701 7fa43fbeb700 10 osd.169 39454 dequeue_op
> 0x287ba400 finish
> 2015-10-23 16:52:15.952729 7fa42a1fd700 15 osd.169 39454 enqueue_op
> 0x287c7500 prio 63 cost 4096 latency 0.000092
> osd_op(client.329656.0:165866002
> rbd_data.14143b5faf8a00.000000000000041f [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 6369280~4096] 12.549948fa
> ack+ondisk+write
> +known_if_redirected e39454) v5
> 2015-10-23 16:52:15.952762 7fa43d3e6700 10 osd.169 39454 dequeue_op
> 0x287c7500 prio 63 cost 4096 latency 0.000125
> osd_op(client.329656.0:165866002
> rbd_data.14143b5faf8a00.000000000000041f [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 6369280~4096] 12.549948fa
> ack+ondisk+write
> +known_if_redirected e39454) v5 pg pg[12.fa( v 39422'12054
> (39253'9006,39422'12054] local-les=39453 n=284 ec=31174 les/c
> 39453/39453 39452/39452/39442) [169,186,153,136] r=0 lpr=39452
> crt=39422'12051 lcod 0'0 mlcod 0'0 active+clean]
> 2015-10-23 16:52:15.952787 7fa43d3e6700 20 osd.169 pg_epoch: 39454
> pg[12.fa( v 39422'12054 (39253'9006,39422'12054] local-les=39453 n=284
> ec=31174 les/c 39453/39453 39452/39452/39442) [169,186,153,136] r=0
> lpr=39452 crt=39422'12051 lcod 0'0 mlcod 0'0 active+clean]
> op_has_sufficient_caps pool=12
> (ssd-pool ) owner=0 need_read_cap=1 need_write_cap=1
> need_class_read_cap=0 need_class_write_cap=0 -> NO
> 2015-10-23 16:52:15.952832 7fa43d3e6700 10 osd.169 39454 dequeue_op
> 0x287c7500 finish
> 
> It looks like maybe the OP is bouncing between threads or something
> like that and never getting dispatched correctly. This is on
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)

I don't see the same op looping in teh snippet above (each only appears 
once for enqueue_op and once for dequeue_op), but it's hard to read after 
wordwrapping to 80 cols.  Do you mind posting a larger snippet of the log 
somewhere?

sage

> I followed the directions at [1] in the writeback section for
> disabling the cache tier. I did this on another cluster recently and
> didn't see this problem (only took an hour or two to evict 700GB of
> data). I saw a very slow evict a while back, but just chalked it up to
> btrfs being dumb.
> 
> [1] http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-a-writeback-cache
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.2
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWKr8iCRDmVDuy+mK58QAAl7oP/2bfWT/DUxjxkEGIXOnW
> x2dk7hlTAkTAts7bRQn+vL31bcHAEoN+nJH5J39nt98Kh4rpZyHcVtJammYM
> hYNsc97Hah+O2iqDYWxr4Vu27eyq5icCw0oFo9hbsA6lGPcrgeVSGbnXaFk0
> +4u6OCovRDb6Rgfc6IAJoHrkulnPaWGaKWa1sfofdlMDJoA2U5cnX7ZvkRnM
> unnKKtAjmeVCFbt/QmwFJVEjdNh0yZXN2qLSJUVwf4APgB4G2trupQ6kuisn
> oGZdJP1DRtFn5IgoZQzB4ZA1FNydrVoQmrPNTAkCGQoUWegbmhxrXxsrfLV9
> jTky2inv8NLdP46VyEVyvOqFD3WVRYIP+KvVcaNnL2awMClOgZ4uish+rLK/
> Uyhcm9dfXgBz1Fm+PzdzpMwK8/lvk/UcfnBs6s5rELgjCQ/WPLMu7y1ZP9FY
> DzLsrtMdbMGZA7MTt9CKAe9fZ+sPC3Se8slh6mN7M7Tm2Knsv6uyKLJ06imW
> M7DMGVMi2A3I+qTE/DEA/D5MLcx+vJtLD57/a2j3dWaEiiBqaAsMA18UiwEX
> bMaHAbFMWBrlhIVytP4IaA7dbiEaO6C90pyHHtyVf1etKDfP+erNRtHVW4JI
> 8wpG87Z34o6r92fkbb0LRH/TvD+WQFU4Q0OiL5JqclJKJP6TNUUhMmstad62
> vmqk
> =dO6U
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html