Re: OSDs are crashing during PG replication

Shinobu Kinjo <shinobu.kj@xxxxxxxxx> · Fri, 11 Mar 2016 18:13:29 +0900

On Mar 11, 2016 3:12 PM, "Alexander Gubanov" <shtnik@xxxxxxxxx> wrote:

>

> Sorry, I didn't have time to answer.

>

> >1st you said, 2 osds were crashed every time. From the log you pasted,

> >it makes sense to do something for osd.3.

>

> The problem is one PG 3.2. This PG is on osd.3 and osd.16 and this osds are both were crashed every time.

>

> >> rm -rf

> >> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

>

> >What makes me confused now is this.

> >Was osd.4 also crashed like osd.3?

>

> I thought that the problem is osd.13 or osd.16. I tried to disable these osds:

> # ceph osd crush reweight osd.3 0 

> # ceph osd crush reweight osd.16 0

> but when I did it 2 another osds were crashed and one of them is osd.4 and  the pg 3.2 was on osd.4.

>

> After this I decided to remove cache pool.

> Now I'm moving all data to new big ssd and so far all all right. 

>
Thanks for letting me know.

That is good to know.
I hope you are playing with the Ceph again!
> On Fri, Mar 4, 2016 at 10:44 AM, Shinobu Kinjo <shinobu.kj@xxxxxxxxx> wrote:

>>

>> Thank you for your explanation.

>>

>> > Every time 2 of 18 OSDs are crashing. I think it's happening when run PG replication because crashing only 2 OSDs and every time they're are the same.

>>

>> 1st you said, 2 osds were crashed every time. From the log you pasted,

>> it makes sense to do something for osd.3.

>>

>> > rm -rf

>> > /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

>>

>> What makes me confused now is this.

>> Was osd.4 also crashed like osd.3?

>>

>> >    -1> 2016-02-24 04:51:45.904673 7fd995026700  5 -- op tracker -- , seq: 19231, time: 2016-02-24 04:51:45.904673, event: started, request: osd_op(osd.13.12097:806247 rb.0.218d6.238e1f29.000000010db3 [copy-get max 8388608] 3.94c2bed2 ack+read+ignore_cache+ignore_overlay+map_snap_clone e13252) v4

>>

>> And crash seems to happen during this process, what I really want to

>> know is what this message inferred.

>> Did you check osd.13?

>>

>> Anyhow your cluster is now fine...no?

>> That's good news.

>>

>> Cheers,

>> Shinobu

>>

>> On Fri, Mar 4, 2016 at 11:05 AM, Alexander Gubanov <shtnik@xxxxxxxxx> wrote:

>> > I decided to refuse use of ssd cache pool and create just 2 pool. 1st pool

>> > only of ssd for fast storage 2nd only of hdd for slow storage.

>> > What about this file, honestly, I don't know why it is created. As I say I

>> > flush the journal for fallen OSD and remove this file and then I start osd

>> > damon:

>> >

>> > ceph-osd --flush-journal osd.3

>> > rm -rf

>> > /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

>> > service ceph start osd.3

>> >

>> > But if I turn the cache pool off  the file isn't created:

>> >

>> > ceph osd tier cache-mode ${cahec_pool} forward

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>>

>>

>>

>> --

>> Email:

>> shinobu@xxxxxxxxx

>> GitHub:

>> shinobu-x

>> Blog:

>> Life with Distributed Computational System based on OpenSource

>

>

>

>

> -- 

> Alexander Gubanov

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com