slow OSD brings down the cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mark,

I've been playing with the reweight on 3 of the OSDs (BTW each OSD is
backed by a HDD, with a SSD backing all the 4 journals on each host) and
these slower ones were given a reweight of 0.5, 0.66 and 0.66.

>From what I gathered the reweight would also reduce the number of I/O
directed at that OSD - as it did reduce the number of PGs mapped to that
OSD, and did all the rebalancing - however using iotstat I've seen the load
on these is usually on par or above the remaining of the cluster, thus
making them slower. I'm measuring the load as total IOPS. All these HDDs
are used exclusively for the OSD role.

Also I'm thinking of moving some smaller pools to be stored on SSDs and I'm
using "df detail" to try and figure out which should I move, but I'm unsure
what the READ and WRITE columns mean.

thanks,


On 6 August 2014 17:10, Mark Nelson <mark.nelson at inktank.com> wrote:

> On 08/06/2014 03:43 AM, Luis Periquito wrote:
>
>> Hi,
>>
>> In the last few days I've had some issues with the radosgw in which all
>> requests would just stop being served.
>>
>> After some investigation I would go for a single slow OSD. I just
>> restarted that OSD and everything would just go back to work. Every
>> single time there was a deep scrub running on that OSD.
>>
>> This has happened in several different OSDs, running in different
>> machines. I currently have 32 OSDs on this cluster, with 4 OSD per host.
>>
>> First thing is should this happen? A single OSD with issues/slowness
>> shouldn't bring the whole cluster to a crawl...
>>
>
> When a client is writing data out to the cluster, it will issue some
> number of operations that it can have in flight at once.  This has to be
> bound at some level to avoid running out of memory.  Ceph will distribute
> those writes to some number of PGs in a psuedo-random like way, and those
> PGs map to specific OSDs where the data will be placed. One of the big
> advantages of crush is that it lets the client figure this mapping out
> itself based on the object name and the cluster topology, so you remove a
> centralized allocation table lookup from the data path which can be a huge
> win vs other large-scale distributed systems.
>
> The downside is that it means that in a setup where you have 1 disk behind
> each OSD (typically the best setup for Ceph right now), every disk will
> receive a relatively even (or potentially weighted) percentage of the
> writes regardless of how fast/slow/busy it is.  If a single OSD is slower
> than the others, over time it is likely to accumulate enough outstanding
> IOs that eventually nearly every client IO will be waiting on that OSD.
>  The rest of the OSDs in the cluster will only get new IOs once an IO
> completes on the slow one.
>
> Some day, maybe after the keyfilestore is implemented, I think it would be
> a very interesting experiment to try a hybrid approach where you use crush
> to distribute data to nodes, but behind the OSDs you use something like an
> allocation table and dynamically change the ratio of writes to different
> filesystems or key/value stores based on how slow/busy they are (especially
> during compaction, directory splitting, scrub, or if there's a really hot
> object on a specific disk).  You can still avoid the network allocation
> table lookup, but potentially within the node, if you can do it fast
> enough, you might be able to gain some level of adaptability and
> (hopefully) more consistent throughput.
>
> Mark
>
>
>> How can I make it stop happening? What kind of debug information can I
>> gather to stop this from happening?
>>
>> any further thoughts?
>>
>> I'm still running Emperor (0.72.2).
>>
>> --
>>
>> Luis Periquito
>>
>> Unix Engineer
>>
>>
>> Ocado.com <http://www.ocado.com/>
>>
>>
>>
>> Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
>> Hatfield, Herts AL10 9NE
>>
>>
>> Notice:  This email is confidential and may contain copyright material
>> of members of the Ocado Group. Opinions and views expressed in this
>> message may not necessarily reflect the opinions and views of the
>> members of the Ocado Group.
>>
>> If you are not the intended recipient, please notify us immediately and
>> delete all copies of this message. Please note that it is your
>> responsibility to scan this message for viruses.
>>
>> References to the ?Ocado Group? are to Ocado Group plc (registered in
>>
>> England and Wales with number 7098618) and its subsidiary undertakings
>> (as that expression is defined in the Companies Act 2006) from time to
>> time.  The registered office of Ocado Group plc is Titan Court, 3
>> Bishops Square, Hatfield Business Park, Hatfield, Herts. AL10 9NE.
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Luis Periquito

Unix Engineer

Ocado.com <http://www.ocado.com/>

Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
Hatfield, Herts AL10 9NE

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.  

References to the ?Ocado Group? are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140807/10dcd5e4/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux