On Fri, 2006-06-30 at 19:17 -0700, bruce wrote:
i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
You haven't mentioned what your sources are. Can you fix the broken HTML before trying to parse it? Whether that be fixing up your own web pages, or fixing other's broken pages with some sort of proxy between their server and your system.