slow OSD brings down the cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You can use the

 ceph osd perf

command to get recent queue latency stats for all OSDs.  With a bit 
of sorting this should quickly tell you if any OSDs are going 
significantly slower than the others.

We'd like to automate this in calamari or perhaps even in the monitor, but 
it is not immediately clear what thresholds would provide a useful 
signal without generating noise...

sage


On Wed, 6 Aug 2014, Luis Periquito wrote:

> Hi Wido,
> 
> as the backing disk is running a deep scrub it's constantly 100% busy, no
> errors though...
> 
> I'm running everything on XFS.
> 
> I had a similar feeling that was the OSD slowing down those requests. What
> would be the affected pool? ".rgw"?
> 
> thanks,
> 
> 
> On 6 August 2014 10:08, Wido den Hollander <wido at 42on.com> wrote:
>       On 08/06/2014 10:43 AM, Luis Periquito wrote:
>             Hi,
> 
>             In the last few days I've had some issues with the
>             radosgw in which all
>             requests would just stop being served.
> 
>             After some investigation I would go for a single
>             slow OSD. I just
>             restarted that OSD and everything would just go back
>             to work. Every
>             single time there was a deep scrub running on that
>             OSD.
> 
>             This has happened in several different OSDs, running
>             in different
>             machines. I currently have 32 OSDs on this cluster,
>             with 4 OSD per host.
> 
>             First thing is should this happen? A single OSD with
>             issues/slowness
>             shouldn't bring the whole cluster to a crawl...
> 
> 
> So, it's not the whole cluster which is slow, but the RGW is
> requesting objects which are in a PG where that OSD is currently
> primary for.
> 
> For you it seems like the whole cluster is down, but it's just 'bad
> luck' in this case.
> 
> Have you checked if there is anything wrong with the backing disk?
> 100% busy? Read errors?
> 
> You can also simply mark the osd as 'out' leave it out of the cluster.
> Re-format the whole OSD and see if it comes back.
> 
> Are you using btrfs by any chance?
> 
> Wido
> 
>       How can I make it stop happening? What kind of debug
>       information can I
>       gather to stop this from happening?
> 
>       any further thoughts?
> 
>       I'm still running Emperor (0.72.2).
> 
>       --
> 
>       Luis Periquito
> 
>       Unix Engineer
> 
> 
> Ocado.com <http://www.ocado.com/>
> 
> 
> Head Office, Titan Court, 3 Bishop Square, Hatfield Business
> Park,
> Hatfield, Herts AL10 9NE
> 
> 
> Notice: ?This email is confidential and may contain copyright
> material
> of members of the Ocado Group. Opinions and views expressed in
> this
> message may not necessarily reflect the opinions and views of
> the
> members of the Ocado Group.
> 
> If you are not the intended recipient, please notify us
> immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.
> 
> References to the ?Ocado Group? are to Ocado Group plc (registered
> in
> England and Wales with number 7098618) and its subsidiary
> undertakings
> (as that expression is defined in the Companies Act 2006) from
> time to
> time. ?The registered office of Ocado Group plc is Titan Court,
> 3
> Bishops Square, Hatfield Business Park, Hatfield, Herts. AL10
> 9NE.
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> --
> 
> Luis Periquito
> 
> Unix Engineer
> 
> 
> [OK-jrauGL__Y524AJ8DP43U6HIu0VAlmBOvx5Sx8z30WE8uZDb_rNprI4o6OPgv-lD30rjmTyO
> UP-N5Gy_Tbjm0X4V3a_14wg8Jq_AL-fymDId6aRXh6_xBLs1KCUM797w] Ocado.com
> 
> 
> Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park, Hatfield,
> Herts AL10 9NE
> 
> 
> Notice: ?This email is confidential and may contain copyright material of
> members of the Ocado Group. Opinions and views expressed in this message may
> not necessarily reflect the opinions and views of the members of the Ocado
> Group.
> 
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.?
> 
> References to the ?Ocado Group? are to Ocado Group plc (registered in England
> and Wales with number 7098618) and its subsidiary undertakings (as that
> expression is defined in the Companies Act 2006) from time to time.? The
> registered office of Ocado Group plc is Titan Court, 3 Bishops Square,
> Hatfield Business Park, Hatfield, Herts. AL10 9NE.
> 
> 
> 


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux