Hi All, As I pointed out earlier, for rdma protocol, we need to register memory which is used during rdma read and write with rdma device. In fact it is a costly operation. To avoid the registration of memory in i/o path, we came up with two solutions. 1) To use a separate per-registered iobuf_pool for rdma. The approach needs an extra level copying in rdma for each read/write request. ie, we need to copy the content of memory given by application to buffers of rdma in the rdma code. 2) Register default iobuf_pool in glusterfs_ctx with rdma device during the rdma initialize. Since we are registering buffers from the default pool for read/write, we don't require either registration or copying. But the problem comes when io-cache translator is turned-on; then for each page fault, io-cache will take a ref on the io-buf of the response buffer to cache it, due to this all the pre-allocated buffer will get locked with io-cache very soon. Eventually all new requests would get iobufs from new iobuf_pools which are not registered with rdma and we will have to do registration for every iobuf. To address this issue, we can: i) Turn-off io-cache (we chose this for testing) ii) Use separate buffer for io-cache, and offload from default pool to io-cache buffer. (New thread to offload) iii) Dynamically register each newly created arena with rdma, for this need to bring libglusterfs code and transport layer code together. (Will need changes in packaging and may bring hard dependencies of rdma libs) iv) Increase the default pool size. (Will increase the footprint of glusterfs process) We implemented two approaches, (1) and (2i) to get some performance numbers. The setup was 4*2 distributed-replicated volume using ram disks as bricks to avoid hard disk bottleneck. And the numbers are attached with the mail. Please provide the your thoughts on these approaches. Regards Rafi KC
Seperate buffer for rdma (1) No change Register Default iobuf pool(2i) write read io-cache off write read io-cache off write read io-cache off 1 373 527 656 343 483 532 446 512 696 2 380 528 668 347 485 540 426 525 715 3 376 527 594 346 482 540 422 526 720 4 381 533 597 348 484 540 413 526 710 5 372 527 479 347 482 538 422 519 719 Note: (varying result ) Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712 command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000; write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync; vol info "Volume Name: xcube Type: Distributed-Replicate Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5 Status: Started Snap Volume: no Number of Bricks: 4 x 2 = 8 Transport-type: rdma Bricks: Brick1: 192.168.44.105:/home/ram0/b0 Brick2: 192.168.44.106:/home/ram0/b0 Brick3: 192.168.44.107:/brick/0/b0 Brick4: 192.168.44.108:/brick/0/b0 Brick5: 192.168.44.105:/home/ram1/b1 Brick6: 192.168.44.106:/home/ram1/b1 Brick7: 192.168.44.107:/brick/1/b1 Brick8: 192.168.44.108:/brick/1/b1 Options Reconfigured: performance.io-cache: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel