On Wed, 24 May 2006, Mike O'Connor wrote: > :Sun says it is jabber, which is why I put it quotes. Since they have not > :replicated in lab, they are jumping to conclusions. Yes, I agree, > :it is very specific and the backline engineer usage appears 'stretching things' > Most Sun adapters have an actual jabber counter that netstat -k will > spew out for you. You can eliminate ambiguity easily enough. Here's > an example I Google'd for: > indeed, and using kstat shows count of 0. more ammo in my favor and presented back to irritating backline. > netstat -k eri0 > eri0: > ipackets 525571 ierrors 365 opackets 8446 oerrors 0 collisions 85 > ifspeed 10000000 rbytes 73324309 obytes 1118022 multircv 99205 multixmt > 6 brdcstrcv 415863 > brdcstxmt 10 norcvbuf 0 noxmtbuf 0 inits 4 rx_inits 8 tx_inits 1 > nocarrier 1 nocanput 0 allocbfail 0 drop 321 pasue_rcv_cnt 0 > pasue_on_cnt 0 pasue_off_cnt 0 pasue_time_cnt 0 txmac_urun 0 > txmac_maxpkt_err 0 excessive_coll 0 late_coll 0 first_coll 35 > defer_timer_exp 0 peak_attempt_cnt 0 jabber 0 no_tmds 0 > > (see, "jabber") > > tx_hang 0 rx_corr 0 no_free_rx_desc 0 rx_overflow 0 rx_hang 0 > rx_align_err 64 rx_crc_err 19 rx_length_err 0 rx_code_viol_err 0 > bad_pkts 321 runt 40 toolong_pkts 279 rxtag_error 0 parity_error 0 > pci_error_interrupt 0 unknown_fatal 0 pci_data_parity_err 0 > pci_signal_target_abort 0 pci_rcvd_target_abort 0 pci_rcvd_master_abort 0 > pci_signal_system_err 0 pci_det_parity_err 0 ipackets64 525571 > opackets64 8446 rbytes64 73324309 obytes64 1118022 pmcap 4 > > :In this case it's tcp/ip. > : > :step 1) telnet to router > :step 2) ping some remote device on a fast link (like 2GB IP/Sonet) > :step 3) watch as returning tcp/ip telnet stream DOS's the sun. > : > :it is not the cisco ping the is DOS'ing the sun, it is the return stream > :of !!..!.!!!....!!!..!!!... (ad infinitum) > > Ahhh, so it's just the return traffic from the Cisco printing out all > those !!..!.!!! stuff (corresponding to whatever it is the the Cisco is > pinging) that causes all this? Nifty! I didn't think that the Cisco > could print that fast! I'm fairly certain it should rate-limit/sample > that output (unless some automated thingy actually cares about that > output coming from the Cisco). > you'd be surprised how fast a gsr can spit out streams of !.!..!..! (30,000 pps before sun craps out. ;) > :the nagle comes into play in the tcp-stream not coalescing all the > :single char tcp/ip packets each with a single ! or . in it. > > Makes perfect sense now that I get what the traffic is. As an aside, > the Nagle algorithm was designed with telnet explicitly in mind, per > RFC 896. But, a lot of folks these days use telnet for stuff apart > from interactive use, and I could see someone wanting to disable it > for performance' sake. For bare-bones stack implementations, Nagle > may not be there at all. > yep. It's just not turned on on routers by default, so this one caught us a little bit by surprise when engineers were running a burn in test in the lab on an OC-192 card. (_usually_ you don't cream a router with lots of little packets via telnet) > :right. totally agreed. it should not cause the machine to totally lock up. > :(I specified wrong earlier, btw. Break still works, just nothing else does) > > That makes it sound even more like an interrupt issue rather than some > overall system lock. > also to me. > :> In this particular case, if you're talking about ICMP, and there > :> really isn't a "jabber"/physical layer issue afoot, the idea is for > ... > :getting that someone to not slap a 'jabber' label on things and > :dismiss it out of hand is where I am currently frustrated beyond > :belied. > > Beyond netstat -k, you can probably use lockstat or other kernel > profiling tools as I mentioned in my earlier post to give them a > good idea of where the bug really is. Interrupt issues aren't > always going to be cut and dried. There could be some particular > flavor of IOS, network adapter, media type, CPU, OS, etc. that > is more prone or less prone to the problem. > > :well, yes, this was all quite accidental in the first place. > :The solution is really quite easy, don't disable nagle on the > :cisco in the first place. However, I'm much more concerned about > :the implications of a normal user being able to DOS the machine and > :Sun not caring enough to do due dilligence to address the issue. > > Judging from the amount of times we've exchanged emails (I should > have asked for a network diagram sooner to help visualize this :) ), > sometimes it's not so easy. And "what is or isn't a DoS" can be a > grey line where reasonable people may differ. I could readily see > someone saying "if you point a stupid amount of traffic at something > it dies, have you considered just not doing that?". > yup. I've got plenty of ammo to throw at irritating and dubiously self-righteous backline, but sometimes the only way to raise matters above somebody who doesn't want to admit there is a problem, is to provide a little community pressure to fix it. (even if it isn't critical or may be hard to reproduce without appreciably fast equipment on hand). A DOS that makes a machine unusable is a DOS. Mis-categorizing it (on their half) as jabber is wrong as well as condescending (left that part out) and just plain irritating from a company that usually takes operating system availability much more seriously. Doug