--- Richard Lynch <ceo@xxxxxxxxx> wrote: > What regular expression does one use when there really isn't a > whole lot you can say about the text?... > > I mean, say for a guestbook or bulletin board or for a person's > Bio or... > > You can limit it to a certain number of characters in length. > > You can mess with strip_tags and also do an ereg to rip out any > kind of JavaScript on tags you want to *allow*. > > But then what? > > I mean, it seems like there's still an awful lot of wiggle room > for mischief there, in an arbitrary string typed by the user. This type of data is certainly the most difficult to filter, especially if you try to adhere to very strict security principles. You start with the same question as with any other data - what exactly do I want to allow? This is much easier and less prone to error than asking what you want to reject. If someone is entering a bio, a whitelist is difficult to create, but not impossible. The best approach to take when valid data is an unknown is to create a system that learns. This can be as simple as enabling a whitelist approach, and logging all failures, but using some other method for interim protection (e.g., a whitelist failure is not considered a security breach). Manual inspection of failures can be used to enhance the whitelist, and once you feel it is capable, you can switch to this as the primary method of protection. I must admit that I often take the lazy way out (with the caveat that some situations demand a higher level of security and a more strict adherence to best practices). The lazy way to filter output is htmlentities(), a function that converts every character that has an equivalent HTML entity to that entity. Thus, any character that may have special meaning to a browser is converted to something that is only useful in displaying that character. If you want to allow some markup, convert those back (use a literal match when possible - pattern matching as a good last resort). When using something in an SQL query, there are some good escaping functions that can be used. I feel pretty comfortable using mysql_escape_string() on any data to eliminate the practicality of SQL injection. Of course, this shouldn't be a complete substitute for proper data filtering, so I'm still talking about the lazy (or "least you can do") approach. So, while I agree that free-form text is very difficult to filter, there are some pretty simple steps you can take to mitigate the risks, or you can adhere to strict practices if you work at it. Hope that helps. Chris ===== Chris Shiflett - http://shiflett.org/ PHP Security - O'Reilly HTTP Developer's Handbook - Sams Coming Soon http://httphandbook.org/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php