Rafi, great results, thanks. Your "io-cache off" columns are read tests with the io-cache translator disabled, correct? What jumps out at me from your numbers are two things: - io-cache translator destroys RDMA read performance. - approach 2i) "register iobuf pool" is the best approach. -- on reads with io-cache off, 32% better than baseline and 21% better than 1) "separate buffer" -- on writes, 22% better than baseline and 14% better than 1) Can someone explain to me why the typical Gluster site wants to use the io-cache translator, given that FUSE now caches file data? Should we just have it turned off by default at this point? This would buy us time to change io-cache implementation to be compatible with RDMA (see below option "2ii"). remaining comments inline -ben ----- Original Message ----- > From: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> > To: gluster-devel@xxxxxxxxxxx > Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Anand Avati" <avati@xxxxxxxxxxx>, "Ben Turner" > <bturner@xxxxxxxxxx>, "Ben England" <bengland@xxxxxxxxxx>, "Suman Debnath" <sdebnath@xxxxxxxxxx> > Sent: Friday, January 23, 2015 7:43:45 AM > Subject: RDMA: Patch to make use of pre registered memory > > Hi All, > > As I pointed out earlier, for rdma protocol, we need to register memory > which is used during rdma read and write with rdma device. In fact it is a > costly operation. To avoid the registration of memory in i/o path, we > came up with two solutions. > > 1) To use a separate per-registered iobuf_pool for rdma. The approach > needs an extra level copying in rdma for each read/write request. ie, we > need to copy the content of memory given by application to buffers of > rdma in the rdma code. > copying data defeats the whole point of RDMA, which is to *avoid* copying data. > 2) Register default iobuf_pool in glusterfs_ctx with rdma device during > the rdma > initialize. Since we are registering buffers from the default pool for > read/write, we don't require either registration or copying. This makes far more sense to me. > But the > problem comes when io-cache translator is turned-on; then for each page > fault, io-cache will take a ref on the io-buf of the response buffer to > cache it, due to this all the pre-allocated buffer will get locked with > io-cache very soon. > Eventually all new requests would get iobufs from new iobuf_pools which > are not > registered with rdma and we will have to do registration for every iobuf. > To address this issue, we can: > > i) Turn-off io-cache > (we chose this for testing) > ii) Use separate buffer for io-cache, and offload from > default pool to io-cache buffer. > (New thread to offload) I think this makes sense, because if you get a io-cache translator cache hit, then you don't need to go out to the network, so io-cache memory doesn't have to be registered with RDMA. > iii) Dynamically register each newly created arena with rdma, > for this need to bring libglusterfs code and transport > layer code together. > (Will need changes in packaging and may bring hard > dependencies of rdma libs) > iv) Increase the default pool size. > (Will increase the footprint of glusterfs process) > registration with RDMA only makes sense to me when data is going to be sent/received over the RDMA network. Is it hard to tell in advance which buffers will need to be transmitted? > We implemented two approaches, (1) and (2i) to get some > performance numbers. The setup was 4*2 distributed-replicated volume > using ram disks as bricks to avoid hard disk bottleneck. And the numbers > are attached with the mail. > > > Please provide the your thoughts on these approaches. > > Regards > Rafi KC > > >
Seperate buffer for rdma (1) No change Register Default iobuf pool(2i) write read io-cache off write read io-cache off write read io-cache off 1 373 527 656 343 483 532 446 512 696 2 380 528 668 347 485 540 426 525 715 3 376 527 594 346 482 540 422 526 720 4 381 533 597 348 484 540 413 526 710 5 372 527 479 347 482 538 422 519 719 Note: (varying result ) Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712 command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000; write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync; vol info "Volume Name: xcube Type: Distributed-Replicate Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5 Status: Started Snap Volume: no Number of Bricks: 4 x 2 = 8 Transport-type: rdma Bricks: Brick1: 192.168.44.105:/home/ram0/b0 Brick2: 192.168.44.106:/home/ram0/b0 Brick3: 192.168.44.107:/brick/0/b0 Brick4: 192.168.44.108:/brick/0/b0 Brick5: 192.168.44.105:/home/ram1/b1 Brick6: 192.168.44.106:/home/ram1/b1 Brick7: 192.168.44.107:/brick/1/b1 Brick8: 192.168.44.108:/brick/1/b1 Options Reconfigured: performance.io-cache: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel