Let's see, Dave Raggett wrote tidy-gui available for several platforms and it is capable of cleaning up the html word generates so that it conforms to w3c standards. After that I've heard of people using abbyword and antiword to get text out of word documents.