Re: can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

Alejandro Comisario <alejandro@xxxxxxxxxxx> · Thu, 23 Mar 2017 14:22:06 -0300

Deffinitelly in our case OSD were not the guilty ones, since all osd that where blocking requests allways from the same pool, worked flawlesly (and still do) after we deleted the pool where we always saw the blocked PG's.

Since the pool was accesed by just one client, and had almost no ops to it, i really dont know how to reproduce the issue, but surely scares me to happen ever again, and most, taking into consideration that blocked iops on a OSD could cascade through all the cluster and block all other pools.

It was technically hard to explain to the management that one 1.5GB pool locked almost 250 vms from different TB size pools, and most of it, not having a root cause (meaning, why only that pool generated blocked iops)

hope to hear some more technical insights or someone else that went through the same.
best.

On Thu, Mar 23, 2017 at 5:47 AM, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:

    I think Greg (who appears to be a ceph
      committer) basically said he was interested in looking at it, if
      only you had the pool that failed this way.

      Why not try to reproduce it, and make a log of your procedure so
      he can reproduce it too? What caused the slow requests... copy on
      write from snapshots? A bad disk? exclusive-lock with 2 clients
      writing at the same time maybe?

      I'd be interested in a solution too... like why can't idle disks
      (non-full disk queue) mean that the osd op or whatever queue can
      still fill with requests not related to the blocked pg/objects? I
      would love for ceph to handle this better. I suspect some issues I
      have are related to this (slow requests on one VM can freeze
      others [likely blame the osd], even requiring kill -9 [likely
      blame client librbd]).

      On 03/22/17 16:18, Alejandro Comisario wrote:

        any
          thoughts ?

        On Tue, Mar 14, 2017 at 10:22 PM,
          Alejandro Comisario <alejandro@xxxxxxxxxxx>
          wrote:

              Greg,
                thanks for the reply.
              True
                that i cant provide enough information to know what
                happened since the pool is gone.

              But
                based on your experience, can i please take some of your
                time, and give me the TOP 5 fo what could happen / would
                be the reason to happen what hapened to that pool (or
                any pool) that makes Ceph (maybe hapened specifically in
                Hammer ) to behave like that ?

              Information
                that i think will be of value, is that the cluster was 5
                nodes large, running "0.94.6-1trusty" i added two nodes
                running the latest "0.94.9-1trusty" and replication into
                those new disks never ended, since i saw WEIRD errors on
                the new OSDs, so i thought that packages needed to be
                the same, so i "apt-get upgraded" the 5 old nodes
                without restrting nothing, so rebalancing started to
                happen without errors (WEIRD).

              after
                these two nodes reached 100% of the disks weight, the
                cluster worked perfectly for about two weeks, till this
                happened.
              After
                the resolution from my first email, everything has been
                working perfect.

              thanks
                for the responses.

                  On Fri, Mar 10, 2017 at 4:23
                    PM, Gregory Farnum <gfarnum@xxxxxxxxxx>
                    wrote:

                            On Tue, Mar 7, 2017 at 10:18 AM
                              Alejandro Comisario <alejandro@xxxxxxxxxxx>
                              wrote:

                                Gregory,
                                  thanks for the response, what you've
                                  said is by far, the most enlightneen
                                  thing i know about ceph in a long
                                  time.

                                What
                                  brings even greater doubt, which is,
                                  this "non-functional" pool, was only
                                  1.5GB large, vs 50-150GB on the other
                                  effected pools, the tiny pool was
                                  still being used, and just because
                                  that pool was blovking requests, the
                                  whole cluster was unresponsive.

                                So
                                  , what do you mean by "non-functional"
                                  pool ? how a pool can become
                                  non-functional ? and what asures me
                                  that tomorrow (just becaue i deleted
                                  the 1.5GB pool to fix the whole
                                  problem) another pool doesnt becomes
                                  non-functional ?

                          Well, you said there were a bunch of slow
                            requests. That can happen any number of
                            ways, if you're overloading the OSDs or
                            something.
                          When there are slow requests, those ops
                            take up OSD memory and throttle, and so they
                            don't let in new messages until the old ones
                            are serviced. This can cascade across a
                            cluster -- because everything is
                            interconnected, clients and OSDs end up with
                            all their requests targeted at the slow OSDs
                            which aren't letting in new IO quickly
                            enough. It's one of the weaknesses of the
                            standard deployment patterns, but it usually
                            doesn't come up unless something else has
                            gone pretty wrong first.
                          As for what actually went wrong here, you
                            haven't provided near enough information and
                            probably can't now that the pool has been
                            deleted. *shrug*
                          -Greg

-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@xxxxxxxxxxxCell: +54 9 11 3770 1857
_
www.nubeliu.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com