slow osd problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all.

In general - one single slow osd significantly affects the whole cluster
of rbd clients.

If just one osd in rbd pool has significantly increased latency, for example 30ms, while others are 0-1ms.
For any reason, not crashes but just slowed down.
Than every rbd client (application) will periodically get fsync (or direct operation) latency over 10 seconds and higher.

As I understand osd latency is somehow multiplied by length of some abstract queue from application to osd and for application
write latency become, for example, over 10s.

Average period is something like: (number of osds)*(client write period) / (number of writes into different rbd objects) So if pool contains 100 osds, and app touches 2 rbd objs every second - than every minute write will goes to slow osd and app may stuck.

As I understand it is general issue caused by random data distribution.
But is there any way to handle this case, while client is really cluester-aware? Any timeouts for rbd clients after which it switches to secondary osd in pg, or even send duplicated msgs from client to several osds in pg same time. Maybe some applications can prefer x2 network usage with lower latency for example.

B.t.w while rbd client is not updating osd map itselft, only getting it from mon, i'm afraid to think what will happen if mon will have connection to some osd and rbd client for some reason will not.





--

Best regards,
Aleksei Gutikov
Software Engineer | synesis.ru | Minsk. BY
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux