I have a candidate for our running into performance issues with Cyrus under load. I think it could be in the ZFS filesystem layer. Our man Omen has come back from DTrace class and started running it, and while it's early days yet, seems to be some fdsync calls with long times associated with them 30 seconds and more to complete under duress. I look at things from an aerospace engineering perspective with "envelopes". Every filesystem has it's zone where it performs well, then when you get outside the envelope, everything goes to hell. In UFS for example you often had to tweak things or switch to directio or even fastfs in some load patterns. ZFS is still fairly young, so still edges to find. With ZFS there are a number of hits with fdsync bugids. This connects back to I believe Pascal Giengers' thread, which I think resulted in him turning off ZIL to bump up performance. This is speculation at this point, anyhow here's the BugID information for those what's interested: =================================================== Bug ID: 6535160 Synopsis: Lock contention on zl_lock from zil_commit Category: kernel Subcategory: zfs State: 10-Fix Delivered Description: I've been running the filebench varmail load on an T2000 system, upping the load a little to account for the larger size of the system, and encountered some significant lock contention coming from zil_commit(). First, the configuration I'm running on: 32-strand T2000 (Ontario) @ 1GHz 32 GB RAM 4 FC HBAs 12 T3B FC arrays (three per HBA), 4 LUNs presented each (total of 48 LUNs) The system is running Krister's ARC scalability bits One pool has been created, using a simple stripe (no mirror, no RAIDZ/RAIDZ2). Compression is off (default), recordsize is default (128KB). The pool has one filesystem, as it's not entirely straightforward to work with multiple filesystems and filebench. I've tweaked the varmail config with the following settings: filebench> set $nfiles=100000 filebench> set $nthreads=64 The lock contention picture looks like the following. First, typical output of lockstat -cwPW: Adaptive mutex spin: 1156154 events in 30.069 seconds (38449 events/sec) Count indv cuml rcnt spin Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 1141286 100% 100% 0.00 21 0x60040ce6180 zil_commit+0x40 ... Adaptive mutex block: 2828 events in 30.069 seconds (94 events/sec) Count indv cuml rcnt nsec Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 2760 99% 99% 0.00 447336 0x60040ce6180 zil_commit+0x40 This means that we're spinning on the lock an average of 1189 times per CPU per second, and blocking on it an average of 2.9 times per CPU per second for roughly 447ms each time. We get to zil_commit() via: dtrace: description 'fbt::zil_commit:entry ' matched 2 probes CPU ID FUNCTION:NAME 3 50356 :tick-30s zfs`zfs_fsync+0xa4 genunix`fop_fsync+0x14 genunix`fdsync+0x3c unix`syscall_trap+0xac 20244 So at first blush this is all fsync()-related. I'm assuming, of course, that the varmail profile is accurate for the general industry in this regard, so we'll march on. Who else is taking that lock? Adaptive mutex spin: 620972 events in 30.055 seconds (20661 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 574654 93% 93% 0.00 25 0x60040ce6180 zil_commit+0x40 15094 2% 95% 0.00 18 0x60040ce6180 zil_commit_writer+0x200 12918 2% 97% 0.00 15 0x60040ce6180 zil_itx_assign+0x4 9241 1% 99% 0.00 13 0x60040ce6180 zil_lwb_write_start+0x22c 6103 1% 100% 0.00 11 0x60040ce6180 zil_commit+0x14 2015 0% 100% 0.00 11 0x60040ce6180 zil_lwb_write_done+0x24 941 0% 100% 0.00 7 0x60040ce6180 zil_commit_writer+0x32c 4 0% 100% 0.00 19 0x60040ce6180 zil_clean+0x4 2 0% 100% 0.00 7 0x60040ce6180 zil_sync+0x28 ------------------------------------------------------------------------------- So it's still mostly zil_commit() coming from fsync(). Let's peak at the hold times. Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 23624 82% 82% 0.00 31656 0x60040ce6180 cv_wait+0x34 1249 4% 86% 0.00 17210 0x60040ce6180 zil_commit_writer+0x1a8 1248 4% 90% 0.00 13491 0x60040ce6180 zil_itx_assign+0x44 760 3% 93% 0.00 10400 0x60040ce6180 zio_done+0x124 760 3% 96% 0.00 10431 0x60040ce6180 zil_lwb_write_start+0x240 640 2% 98% 0.00 398762 0x60040ce6180 zfs_fsync+0xa4 635 2% 100% 0.00 12352 0x60040ce6180 zil_commit_writer+0x290 5 0% 100% 0.00 15032 0x60040ce6180 zil_commit_writer+0x238 2 0% 100% 0.00 18063206 0x60040ce6180 dmu_objset_sync+0x158 2 0% 100% 0.00 13400 0x60040ce6180 dsl_pool_zil_clean+0x28 ------------------------------------------------------------------------------- Nothing outrageous, aside from the dmu_objset_sync() (which is really zil_sync()) one which is the source of a different CR. Now I'm going to bet that the cv_wait() call is really zil_commit() in disguise: ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Caller 37186 79% 79% 0.00 39440 0x60040ce6180 cv_wait+0x34 nsec ------ Time Distribution ------ count Stack 32768 |@@@@@@@ 9062 zil_commit+0x40 65536 |@@@@@@@@@@@@@@@@@@@@@@ 28093 zfs_fsync+0xa4 131072 | 8 fop_fsync+0x14 262144 | 3 fdsync+0x3 524288 | 7 syscall_trap+0xac 1048576 | 10 2097152 | 1 4194304 | 2 And it is. We do have some outliers, but nothing approaching the dmu_objset_sync() case. There was also a report in the hold times for zfs_fsync() although my trolling through the code doesn't see it taking the suspect lock but rather calling zil_commit() which does. For completeness, the hold times for the zfs_fsync() case look like this: Count indv cuml rcnt nsec Lock Caller 1152 2% 90% 0.00 424344 0x60040ce6180 zfs_fsync+0xa4 nsec ------ Time Distribution ------ count Stack 16384 | 15 fop_fsync+0x14 32768 | 12 fdsync+0x3c 65536 |@ 44 syscall_trap+0xac 131072 |@@ 94 262144 |@@@@@ 207 524288 |@@@@@@@@@@ 385 1048576 |@@@@@@@@@@ 386 2097152 | 3 4194304 | 6 The profile (in the form of <probes:function>) looks like this: 7491134:cpu_halt 2896815:mutex_vector_enter 141563:exec_fault 92557:disp_getwork 68408:lzjb_compress 43288:lwpchan_cache_mapping 36614:_resume_from_idle 35462:mutex_enter 26316:fletcher_2_native 24701:lock_set_spl_spin Which means that when we're doing work (cpu_halt() being, of course, idle) we're spending most of our time in mutex_vector_enter(). The first "real" work we get to is lzjb_compress() which I'm going to assume is from metadata compression. The amount of idle suggests this load could be increased considerably before we would become CPU saturated. A brief check at increasing the load shows that in fact the contention does continue to degrade in proportion and overall throughput begins to suffer. At nearly-full CPU saturation we now see a profile like this: 8821287:mutex_vector_enter 1454131:cpu_halt 130854:exec_fault 93425:sleepq_insert 88498:lock_set_spl_spin 50079:lzjb_compress 46588:disp_getwork 32659:_resume_from_idle 27340:lwpchan_cache_mapping 26751:fletcher_2_native Date Modified: 2007-03-16 01:05:19 GMT+00:00 I took the opportunity to hack up the tsd-based suggestion in the Evaluation and saw some very positive results. On entry into zfs_fsync() we set a tsd flag which is used in zfs_log_write() to mark any writes by that thread as synchronous. With the baseline bits we get: IO Summary: 679672 ops 5605.4 ops/s, (862/863 r/w) 27.5mb/s, 6466us cpu/ op, 36.7ms latency and with the change we get: IO Summary: 885634 ops 7300.9 ops/s, (1123/1124 r/w) 37.5mb/s, 1327us cp u/op, 11.6ms latency In particular, the fsyncfile ops see a reduction in latency from ~70ms/op to ~19ms/op. The lock contention picture now looks like this: Adaptive mutex spin: 917767 events in 30.061 seconds (30531 events/sec) Count indv cuml rcnt spin Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 176094 26% 26% 0.00 8 pr_pidlock[8] pr_p_lock+0xc 76925 16% 42% 0.00 11 0x600442aa4c0 zil_commit+0x40 Which is, obviously, considerably better for the zil_commit() issue. The pr_p_lock() issue is, unfortunately, endemic to filebench's use of microstate accounting for each op with high numbers of threads in a single process but thus unrelated to the issue at hand. Date Modified: 2007-06-19 17:06:55 GMT+00:00 Work Around: Suggested Fix: Evaluation: Fixed by patch: Integrated in Build: snv_71 Duplicate of: Related Change Request(s): 6413510 6579989 Date Modified: 2007-11-29 01:30:17 GMT+00:00 Public Summary: ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html