Re: Request for Comments: Weighted Round Robin OP Queue

Samuel Just <sjust@xxxxxxxxxx> · Wed, 4 Nov 2015 11:49:24 -0800



I didn't look into it closely, but that almost certainly means that
your queue is reordering primary->replica replicated write messages.
-Sam

On Wed, Nov 4, 2015 at 8:54 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I've got some rough code that changes out the token bucket queue in
> PrioritizedQueue.h with a weighted round robin queue located at [1].
> Even though there is some more optimizations that can be done, running
> the fio job [2], I've seen about a ~20% performance increase on
> spindles and ~6% performance increase on SSDs (my hosts are CPU bound
> on SSD).
>
> The idea of this queue is to try to be fair to all OPs relative to
> their priority while at the same time reducing the overhead for each
> OP (queue and dequeue) from O(n) to closer to O(1).
>
> One issue that I'm having is that under certain workloads and usually
> during recovery I get these asserts and need help pinpointing how to
> resolve it.
>
>  osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
> ceph::bufferlist&)' thread 7f55d61fd700 time 2015-11-03
> 14:44:28.638112
> osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
> osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
> ceph::bufferlist&)' thread 7f55d7a00700 time 2015-11-03
> 14:44:28.637053
> osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
>  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x76) [0xc1e3a6]
>  2: ceph-osd() [0x7d5a7c]
>  3: (PG::append_log(std::vector > const&, eversion_t, eversion_t,
> ObjectStore::Transaction&, bool)+0x111) [0x7f7181]
>  4: (ReplicatedPG::log_operation(std::vector > const&,
> boost::optional&, eversion_t const&, eversion_t const&, bool,
> ObjectStore::Transaction*)+0xad) [0x8bfc7d]
>  5: (void ReplicatedBackend::sub_op_modify_impl(std::tr1::shared_ptr)+0x7b9)
> [0xa5e119]
>  6: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr)+0x4a) [0xa4950a]
>  7: (ReplicatedBackend::handle_message(std::tr1::shared_ptr)+0x363) [0xa49923]
>  8: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x159) [0x847ae9]
>  9: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr,
> ThreadPool::TPHandle&)+0x3cf) [0x690cef]
>  10: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x469) [0x691359]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89e)
> [0xc0d8ae]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc0fa00]
>  13: (()+0x80a4) [0x7f55f9edd0a4]
>  14: (clone()+0x6d) [0x7f55f843904d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
>
> I think this means that the PG log to be appended is newer than what
> is expected, but I'm not sure how to rectify it. Any pushes in the
> right direction would be helpful.
>
> It seems that this queue is helping with recover ops even when
> osd_max_backfills=20 with max client ops, but I don't have good long
> term data due to this issue. I think this has also impacted my SSD
> testing as I lose one OSD during the test, reducing the performance
> temporarily.
>
> When looking through my code, please remember.
> 1. This may be the first time I wrote C++ code, or it has been long
> enough it seems like it.
> 2. There is still some optimizations that I know can be done. But I'm
> happy to have people share any optimization opportunities they see.
> 3. I'm trying to understand the reason for the assert and pointers how
> to resolve it.
> 4. It seems like there are multiple threads of the queue keeping the
> queues pretty small. How can I limit the queue to one thread so all
> OPs have to be queued in one queue? I'd like to see the differences
> with changing this.
> 5. I'd like any pointers to improving this code.
>
> Thank you,
> Robert LeBlanc
>
> [1] https://github.com/ceph/ceph/compare/hammer...rldleblanc:wrr-queue
> [2] [rbd-test]
> #readwrite=write
> #blocksize=4M
> runtime=600
> name=rbd-test
> readwrite=randrw
> bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> rwmixread=72
> norandommap
> #size=1T
> #blocksize=4k
> ioengine=rbd
> rbdname=test5
> pool=ssd-pool
> clientname=admin
> iodepth=8
> numjobs=4
> thread
> group_reporting
> time_based
> #direct=1
> ramp_time=60
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWOjheCRDmVDuy+mK58QAAfxkQAJjgP4cjtHiFdtZgR2Zo
> yMPeV1b+ZYoQr4XbyCqWAsRdgigdcesCnjxyTOWnK+nHZgxMOgtHn8rylltV
> 17NzleGKfQUDRe7jLHLOaLDMphODvW0BjJHV8uk5DzYVJhVOhT5oHtJTtRXY
> JtMCIaGcwEPSP9IE+bkzX22fPEeNnkCHFAosmratD2WIeaNrOfV0DNOfAotO
> FX2/w0NtiuNqr+KEH3MrPdHkENXLhG2A8wiLqJ7sN0LvclwGbO9eZ01sv5nV
> bqqS8dQjd4oh31799vBroX73uMOb+ljeXNguz/4l4Tekn+F3m5puFHEX2o23
> NroU1YHNcKFAOwppZ7pDrAn3ATzvOEsZ7574dJw5vPxquCgsF0T8/phsk71D
> E1IOQC/EIqCw4wUnujwlEZXwlSXRLyqT5xUrSXo/qtM4HUz4PmWukxZxOmk/
> Afewcbq/5ElSZQus1xmMdmtGocSGAvMmYthIbXP+3l2127bMK2ptacL6VMSf
> uO+wYCLQZDnpjlx9DYt4CAEbEeuS4vCSzIkGishcuFNHGmM/gXXqYFybAATt
> IbLRWZBrq4TyfJe9sIp6aNPbi/IHxSV4NVVX3q1P2j91UDKKVL6hu9Ln0HTY
> UrFuDnH0yjvwBm4vJ0ksoWLIWTciLTTz68ZyOnOnr+uXGbkQEz1LzMQWZ+Cl
> saYj
> =R1Lm
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html