Re: OSDs are crashing during PG replication

Alexander Gubanov <shtnik@xxxxxxxxx> · Fri, 11 Mar 2016 14:12:25 +0800

Sorry, I didn't have time to answer.

>1st you said, 2 osds were crashed every time. From the log you pasted,
>it makes sense to do something for osd.3.

The problem is one PG 3.2. This PG is on osd.3 and osd.16 and this osds are both were crashed every time.

>> rm -rf
>> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

>What makes me confused now is this.
>Was osd.4 also crashed like osd.3?

I thought that the problem is osd.13 or osd.16. I tried to disable these osds:
# ceph osd crush reweight osd.3 0 
# ceph osd crush reweight osd.16 0
but when I did it 2 another osds were crashed and one of them is osd.4 and  the pg 3.2 was on osd.4.

After this I decided to remove cache pool.
Now I'm moving all data to new big ssd and so far all all right. 

On Fri, Mar 4, 2016 at 10:44 AM, Shinobu Kinjo <shinobu.kj@xxxxxxxxx> wrote:
Thank you for your explanation.

> Every time 2 of 18 OSDs are crashing. I think it's happening when run PG replication because crashing only 2 OSDs and every time they're are the same.

1st you said, 2 osds were crashed every time. From the log you pasted,

it makes sense to do something for osd.3.

> rm -rf

> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

What makes me confused now is this.

Was osd.4 also crashed like osd.3?

>    -1> 2016-02-24 04:51:45.904673 7fd995026700  5 -- op tracker -- , seq: 19231, time: 2016-02-24 04:51:45.904673, event: started, request: osd_op(osd.13.12097:806247 rb.0.218d6.238e1f29.000000010db3 [copy-get max 8388608] 3.94c2bed2 ack+read+ignore_cache+ignore_overlay+map_snap_clone e13252) v4

And crash seems to happen during this process, what I really want to

know is what this message inferred.

Did you check osd.13?

Anyhow your cluster is now fine...no?

That's good news.

Cheers,

Shinobu

On Fri, Mar 4, 2016 at 11:05 AM, Alexander Gubanov <shtnik@xxxxxxxxx> wrote:

> I decided to refuse use of ssd cache pool and create just 2 pool. 1st pool

> only of ssd for fast storage 2nd only of hdd for slow storage.

> What about this file, honestly, I don't know why it is created. As I say I

> flush the journal for fallen OSD and remove this file and then I start osd

> damon:

>

> ceph-osd --flush-journal osd.3

> rm -rf

> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

> service ceph start osd.3

>

> But if I turn the cache pool off  the file isn't created:

>

> ceph osd tier cache-mode ${cahec_pool} forward

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

--

Email:

shinobu@xxxxxxxxx

GitHub:

shinobu-x

Blog:

Life with Distributed Computational System based on OpenSource

-- 
Alexander Gubanov

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com