Running into weird issues here as well in a test environment. I don't have a solution either but perhaps we can find some things in common.. Setup in a nutshell: - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with separate public/cluster network in 10 Gbps) - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (10 Gbps) - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps) Relevant cluster config: Writeback cache tiering with NVME PCI-E cards (2 replica) in front of a erasure coded pool (k=3,m=2) backed by spindles. I'm following the instructions here: http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices No issues with creating and mapping a 100GB RBD image and then creating the target. I'm interested in finding out the overhead/performance impact of re-exporting through iSCSI so the idea is to run benchmarks. Here's a fio test I'm trying to run on the client node on the mounted iscsi device: fio --name=writefile --size=100G --filesize=100G --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio The benchmark will eventually hang towards the end of the test for some long seconds before completing. On the proxy node, the kernel complains with iscsi portal login timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance errors in syslog: http://pastebin.com/AiRTWDwR Doing the same test on the machines directly (raw, rbd, on the osd filesystem) doesn't yield any issues. I've tried a couple things to see if I could get things to work... - Set irqbalance --hintpolicy=ignore (http://sourceforge.net/p/e1000/bugs/394/ & https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1321425) - Changed size on cache pool to 1 (for the sake of testing, improved performance but still hangs) - Set crush tunables to legacy (and back to optimal) - Various package and kernel versions and putting the proxy node on Ubuntu precise - Formatting and mounting the iscsi block device and running the test on the formatted filesystem I don't think it's related .. but I don't remember running into issues before I've swapped out SSDs for the NVME cards for the cache pool. I don't have time *right now* but I definitely want to test if I am able to reproduce the issue on the SSDs.. Let me know if this gives you any ideas, I'm all ears. -- David Moreau Simard > On Oct 28, 2014, at 4:07 PM, Christopher Spearman <neromaverick@xxxxxxxxx> wrote: > > Sage: > > That'd be my assumption, performance looked pretty fantastic over loop until it started being used it heavily > > Mike: > > The configs you asked for are at the end of this message I've subtracted & changed some info, iqn/wwn/portal, for security purposes. The raw & loop target configs are all in one since I'm running both types of configs currently. I also included the running config (ls /) of targetcli for anyone interested in what it looks like from the console. > > The tool I used was dd, I ran through various options using dd but didn't really see much difference. The one on top is my go to command for my first test > > time dd if=/dev/zero of=test bs=32M count=32 oflag=direct,sync > time dd if=/dev/zero of=test bs=32M count=128 oflag=direct,sync > time dd if=/dev/zero of=test bs=8M count=512 oflag=direct,sync > time dd if=/dev/zero of=test bs=4M count=1024 oflag=direct,sync > > > ---ls / from current targetcli (no mounted ext4 -> image file config)--- > > /iscsi> ls / > o- / ......................................................................................................................... [...] > o- backstores .............................................................................................................. [...] > | o- block .................................................................................................. [Storage Objects: 2] > | | o- ceph_lun0 ...................................................................... [/dev/loop0 (2.0TiB) write-thru activated] > | | o- ceph_noloop00 .............................................. [/dev/rbd/vmiscsi/noloop00 (1.0TiB) write-thru activated] > | o- fileio ................................................................................................. [Storage Objects: 0] > | o- pscsi .................................................................................................. [Storage Objects: 0] > | o- ramdisk ................................................................................................ [Storage Objects: 0] > o- iscsi ............................................................................................................ [Targets: 2] > | o- iqn.gateway2_01 ..................................................... [TPGs: 1] > | | o- tpg1 ............................................................................................... [no-gen-acls, no-auth] > | | o- acls .......................................................................................................... [ACLs: 2] > | | | o- iqn.esxhost01 ............................................................ [Mapped LUNs: 1] > | | | | o- mapped_lun0 ......................................................................... [lun0 block/ceph_noloop00 (rw)] > | | | o- iqn.esxhost02 ....................................................... [Mapped LUNs: 1] > | | | o- mapped_lun0 ......................................................................... [lun0 block/ceph_noloop00 (rw)] > | | o- luns .......................................................................................................... [LUNs: 1] > | | | o- lun0 ........................................................... [block/ceph_noloop00 (/dev/rbd/vmiscsi/noloop00)] > | | o- portals .................................................................................................... [Portals: 1] > | | o- xxx.xxx.xxx.xxx:3260 ............................................................................................... [OK] > | o- iqn.gateway2_02 ..................................................... [TPGs: 1] > | o- tpg1 ............................................................................................... [no-gen-acls, no-auth] > | o- acls .......................................................................................................... [ACLs: 2] > | | o- iqn.esxhost01 ............................................................ [Mapped LUNs: 1] > | | | o- mapped_lun0 ............................................................................. [lun0 block/ceph_lun0 (rw)] > | | o- iqn.esxhost02 ............................................................ [Mapped LUNs: 1] > | | | o- mapped_lun0 ............................................................................. [lun0 block/ceph_lun0 (rw)] > | o- luns .......................................................................................................... [LUNs: 1] > | | o- lun0 ................................................................................... [block/ceph_lun0 (/dev/loop0)] > | o- portals .................................................................................................... [Portals: 1] > | o- xxx.xxx.xxx.xxx:3260 ............................................................................................... [OK] > o- loopback ......................................................................................................... [Targets: 0] > > ---saveconfig.json for mounted ext4 config--- > > { > "fabric_modules": [], > "storage_objects": [ > { > "attributes": { > "block_size": 512, > "emulate_dpo": 0, > "emulate_fua_read": 0, > "emulate_fua_write": 1, > "emulate_model_alias": 1, > "emulate_rest_reord": 0, > "emulate_tas": 1, > "emulate_tpu": 0, > "emulate_tpws": 0, > "emulate_ua_intlck_ctrl": 0, > "emulate_write_cache": 1, > "enforce_pr_isids": 1, > "fabric_max_sectors": 8192, > "is_nonrot": 0, > "max_unmap_block_desc_count": 1, > "max_unmap_lba_count": 8192, > "max_write_same_len": 4096, > "optimal_sectors": 8192, > "queue_depth": 128, > "unmap_granularity": 1, > "unmap_granularity_alignment": 0 > }, > "dev": "/mnt/ceph_perf_test/mounted_rbd_img_test.img", > "name": "mounted_rbd_img_test", > "plugin": "fileio", > "size": 8589934592, > "write_back": true, > "wwn": "xxxx-xxxx-xxxx-xxxx" > } > ], > "targets": [ > { > "fabric": "iscsi", > "tpgs": [ > { > "attributes": { > "authentication": 0, > "cache_dynamic_acls": 0, > "default_cmdsn_depth": 16, > "demo_mode_write_protect": 1, > "generate_node_acls": 0, > "login_timeout": 15, > "netif_timeout": 2, > "prod_mode_write_protect": 0 > }, > "enable": true, > "luns": [ > { > "index": 0, > "storage_object": "/backstores/fileio/mounted_rbd_img_test" > } > ], > "node_acls": [ > { > "attributes": { > "dataout_timeout": 3, > "dataout_timeout_retries": 5, > "default_erl": 0, > "nopin_response_timeout": 30, > "nopin_timeout": 15, > "random_datain_pdu_offsets": 0, > "random_datain_seq_offsets": 0, > "random_r2t_offsets": 0 > }, > "mapped_luns": [ > { > "index": 0, > "tpg_lun": 0, > "write_protect": false > } > ], > "node_wwn": "iqn.centoshost01" > } > ], > "parameters": { > "AuthMethod": "CHAP,None", > "DataDigest": "CRC32C,None", > "DataPDUInOrder": "Yes", > "DataSequenceInOrder": "Yes", > "DefaultTime2Retain": "20", > "DefaultTime2Wait": "2", > "ErrorRecoveryLevel": "0", > "FirstBurstLength": "65536", > "HeaderDigest": "CRC32C,None", > "IFMarkInt": "2048~65535", > "IFMarker": "No", > "ImmediateData": "Yes", > "InitialR2T": "Yes", > "MaxBurstLength": "262144", > "MaxConnections": "1", > "MaxOutstandingR2T": "1", > "MaxRecvDataSegmentLength": "8192", > "MaxXmitDataSegmentLength": "262144", > "OFMarkInt": "2048~65535", > "OFMarker": "No", > "TargetAlias": "LIO Target" > }, > "portals": [ > { > "ip_address": "xxx.xxx.xxx.xxx", > "iser": false, > "port": 3260 > } > ], > "tag": 1 > } > ], > "wwn": "iqn.gateway1_01" > }, > ] > } > > ---saveconfig.json for raw & loop targets--- > > { > "fabric_modules": [], > "storage_objects": [ > { > "attributes": { > "block_size": 512, > "emulate_dpo": 0, > "emulate_fua_read": 0, > "emulate_fua_write": 1, > "emulate_model_alias": 1, > "emulate_rest_reord": 0, > "emulate_tas": 1, > "emulate_tpu": 0, > "emulate_tpws": 0, > "emulate_ua_intlck_ctrl": 0, > "emulate_write_cache": 0, > "enforce_pr_isids": 1, > "fabric_max_sectors": 8192, > "is_nonrot": 0, > "max_unmap_block_desc_count": 0, > "max_unmap_lba_count": 0, > "max_write_same_len": 65535, > "optimal_sectors": 8192, > "queue_depth": 128, > "unmap_granularity": 0, > "unmap_granularity_alignment": 0 > }, > "dev": "/dev/rbd/vmiscsi/noloop00", > "name": "ceph_noloop00", > "plugin": "block", > "readonly": false, > "write_back": false, > "wwn": "xxxx-xxxx-xxxx" > }, > { > "attributes": { > "block_size": 512, > "emulate_dpo": 0, > "emulate_fua_read": 0, > "emulate_fua_write": 1, > "emulate_model_alias": 1, > "emulate_rest_reord": 0, > "emulate_tas": 1, > "emulate_tpu": 0, > "emulate_tpws": 0, > "emulate_ua_intlck_ctrl": 0, > "emulate_write_cache": 0, > "enforce_pr_isids": 1, > "fabric_max_sectors": 8192, > "is_nonrot": 0, > "max_unmap_block_desc_count": 0, > "max_unmap_lba_count": 0, > "max_write_same_len": 65535, > "optimal_sectors": 8192, > "queue_depth": 128, > "unmap_granularity": 0, > "unmap_granularity_alignment": 0 > }, > "dev": "/dev/loop0", > "name": "ceph_lun0", > "plugin": "block", > "readonly": false, > "write_back": false, > "wwn": "yyyy-yyyy-yyyy" > }, > ], > "targets": [ > { > "fabric": "iscsi", > "tpgs": [ > { > "attributes": { > "authentication": 0, > "cache_dynamic_acls": 0, > "default_cmdsn_depth": 16, > "demo_mode_write_protect": 1, > "generate_node_acls": 0, > "login_timeout": 15, > "netif_timeout": 2, > "prod_mode_write_protect": 0 > }, > "enable": true, > "luns": [ > { > "index": 0, > "storage_object": "/backstores/block/ceph_noloop00" > } > ], > "node_acls": [ > { > "attributes": { > "dataout_timeout": 3, > "dataout_timeout_retries": 5, > "default_erl": 0, > "nopin_response_timeout": 30, > "nopin_timeout": 15, > "random_datain_pdu_offsets": 0, > "random_datain_seq_offsets": 0, > "random_r2t_offsets": 0 > }, > "mapped_luns": [ > { > "index": 0, > "tpg_lun": 0, > "write_protect": false > } > ], > "node_wwn": "iqn.esxhost01" > }, > { > "attributes": { > "dataout_timeout": 3, > "dataout_timeout_retries": 5, > "default_erl": 0, > "nopin_response_timeout": 30, > "nopin_timeout": 15, > "random_datain_pdu_offsets": 0, > "random_datain_seq_offsets": 0, > "random_r2t_offsets": 0 > }, > "mapped_luns": [ > { > "index": 0, > "tpg_lun": 0, > "write_protect": false > } > ], > "node_wwn": "iqn.esxhost02" > } > ], > "parameters": { > "AuthMethod": "CHAP,None", > "DataDigest": "CRC32C,None", > "DataPDUInOrder": "Yes", > "DataSequenceInOrder": "Yes", > "DefaultTime2Retain": "20", > "DefaultTime2Wait": "2", > "ErrorRecoveryLevel": "0", > "FirstBurstLength": "65536", > "HeaderDigest": "CRC32C,None", > "IFMarkInt": "2048~65535", > "IFMarker": "No", > "ImmediateData": "Yes", > "InitialR2T": "Yes", > "MaxBurstLength": "262144", > "MaxConnections": "1", > "MaxOutstandingR2T": "1", > "MaxRecvDataSegmentLength": "8192", > "MaxXmitDataSegmentLength": "262144", > "OFMarkInt": "2048~65535", > "OFMarker": "No", > "TargetAlias": "LIO Target" > }, > "portals": [ > { > "ip_address": "xxx.xxx.xxx.xxx", > "iser": false, > "port": 3260 > } > ], > "tag": 1 > } > ], > "wwn": "iqn.gateway2_01" > }, > { > "fabric": "iscsi", > "tpgs": [ > { > "attributes": { > "authentication": 0, > "cache_dynamic_acls": 0, > "default_cmdsn_depth": 16, > "demo_mode_write_protect": 1, > "generate_node_acls": 0, > "login_timeout": 15, > "netif_timeout": 2, > "prod_mode_write_protect": 0 > }, > "enable": true, > "luns": [ > { > "index": 0, > "storage_object": "/backstores/block/ceph_lun0" > } > ], > "node_acls": [ > { > "attributes": { > "dataout_timeout": 3, > "dataout_timeout_retries": 5, > "default_erl": 0, > "nopin_response_timeout": 30, > "nopin_timeout": 15, > "random_datain_pdu_offsets": 0, > "random_datain_seq_offsets": 0, > "random_r2t_offsets": 0 > }, > "mapped_luns": [ > { > "index": 0, > "tpg_lun": 0, > "write_protect": false > } > ], > "node_wwn": "iqn.esxhost01" > }, > { > "attributes": { > "dataout_timeout": 3, > "dataout_timeout_retries": 5, > "default_erl": 0, > "nopin_response_timeout": 30, > "nopin_timeout": 15, > "random_datain_pdu_offsets": 0, > "random_datain_seq_offsets": 0, > "random_r2t_offsets": 0 > }, > "mapped_luns": [ > { > "index": 0, > "tpg_lun": 0, > "write_protect": false > } > ], > "node_wwn": "iqn.esxhost02" > }, > ], > "parameters": { > "AuthMethod": "CHAP,None", > "DataDigest": "CRC32C,None", > "DataPDUInOrder": "Yes", > "DataSequenceInOrder": "Yes", > "DefaultTime2Retain": "20", > "DefaultTime2Wait": "2", > "ErrorRecoveryLevel": "0", > "FirstBurstLength": "65536", > "HeaderDigest": "CRC32C,None", > "IFMarkInt": "2048~65535", > "IFMarker": "No", > "ImmediateData": "Yes", > "InitialR2T": "Yes", > "MaxBurstLength": "262144", > "MaxConnections": "1", > "MaxOutstandingR2T": "1", > "MaxRecvDataSegmentLength": "8192", > "MaxXmitDataSegmentLength": "262144", > "OFMarkInt": "2048~65535", > "OFMarker": "No", > "TargetAlias": "LIO Target" > }, > "portals": [ > { > "ip_address": "xxx.xxx.xxx.xxx", > "iser": false, > "port": 3260 > } > ], > "tag": 1 > } > ], > "wwn": "iqn.gateway2_02" > } > ] > } > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com