Re: HW failure cause client IO drops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Are you using a BBU backed raid controller? It sounds more like your
write cache is acting up if you are using one. Can you check what your
raid controller is showing? I have sometimes seen raid controllers
performing consistency checks or patrol read on single drive raid0.
You can disable that if it's running.
If it's lsi based controller you can use this "MegaCli64 -AdpPR -Dsbl
-aALL" for stopping patrol reads or "MegaCli64 -LDCC -Stop -lall
-aall" for stopping consistency check. You can also have a BBU learn
cycle active. Which discharges and charges the battery back up
disabling writeback cache. If it's running the cycle unfortunately,
but you will not be able to enable writeback cache. I recommend
enabling read cache and controller readahead. Use "MegaCli64
-LDSetProp -RA -Immediate -Lall -aAll" to enable read ahead and
"MegaCli64 -LDSetProp -Cached -Immediate -Lall -aAll" to enable cache
on I/O.

Now I wouldn't do this, but you can force writeback mode even with the
BBU off. YOU CAN AND YOU WILL LOSE ALL THE OSDS ON THE NODE IF
SOMETHING BAD HAPPENS. Use at your own risk and discretion: "MegaCli64
-LDSetProp -CachedBadBBU -Immediate -Lall -aAll" .

If these options didn't work. Respond and we will try to help you.

On Tue, Apr 16, 2019 at 3:27 PM M Ranga Swami Reddy
<swamireddy@xxxxxxxxx> wrote:
>
> Its Smart Storage battery, which was disabled due to high ambient temperature.
> All OSD processes/daemon working as is...but those OSDs not responding to other OSD due to high CPU utilization..
> Don't observe the clock skew issue.
>
> On Tue, Apr 16, 2019 at 12:49 PM Marco Gaiarin <gaio@xxxxxxxxx> wrote:
>>
>> Mandi! M Ranga Swami Reddy
>>   In chel di` si favelave...
>>
>> > Hello - Recevenlt we had an issue with storage node's battery failure, which
>> > cause ceph client IO dropped to '0' bytes. Means ceph cluster couldn't perform
>> > IO operations on the cluster till the node takes out. This is not expected from
>> > Ceph, as some HW fails, those respective OSDs should mark as out/down and IO
>> > should go as is..
>> > Please let me know if anyone seen the similar behavior and is this issue
>> > resolved?
>>
>> 'battery' mean 'CMOS battery'?
>>
>>
>> OSDs and MONs need accurate clock sync between them. So, if a node
>> reboot with a clock skew more than (AFAI Remember well) 5 seconds, OSD
>> does not start.
>>
>> Provide a stable NTP server for all your OSDs and MONs, and restart
>> OSDs after clock are in sync.
>>
>> --
>> dott. Marco Gaiarin                                     GNUPG Key ID: 240A3D66
>>   Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/
>>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
>>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797
>>
>>                 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>>       http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
>>         (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux