hi...
i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
i'm not sure why the LibXML functions (perl) and tidy complain about the structure of the page. i also can't see a way to get libXML to ignore the warnings.
as such, i've been wondering if anybody else has had this issue, and how you managed to resolve this in an automated manner...
using firefox, and the XPath plugin with the DOM Inspector, I'm able to traverse the DOM for the web page. I can also create a XPath query that I can use in the XPath window of the firefox plugin to extract/display the correct elements/section of the page...
i'm curious as to whether it might be possible to use the firefox engine, coupled with the DOM/XPath plugin functionality to parse the file from a perl/command line app...
has anyone ever done anything like this, or heard of anyone who has... is it possible to even programatically call the firefox app...
thoughts/comments/etc...
-bruce
and yeah.. i've also posted to the firefow email list.. but thought it might be useful to post here as well!!
On 01/07/06, bruce bedouglas@earthlink.net wrote:
hi...
i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
I have been having a similar problem with html, though this time the guilty party is mshtml. That piece of dog vomit *can not be made* to produce xhtml!!! I get the pseudo-html from mshtml, run it through sgmlreader+converter class and get (x)html out. I can then parse + process the file with standard xml/xsl tools. It took me an age to find good things for .net though - you shouldn't have nearly as many problems on linux/fedora. I suggest you pass it through tidy and get xhtml out. It may give you some junk but you don't really have many other options... Cheers Antoine
Antoine wrote,
I suggest you pass it through tidy and get xhtml out. It may give you some junk but you don't really have many other options...
Tidy is a good choice. I'd also recommend taking a look at TagSoup,
http://mercury.ccil.org/~cowan/XML/tagsoup/
Cheers,
Miles
hi antoine...
i had already tried tidy with no success... it still seems to still generate warnings...
initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
the xpath/linxml functions in the perl app complain regarding the file. my thought is that tidy isn't cleaning enough, or that the perl xpath/libxml functions are too strict!
but the weird thing is that i can use Firefox with the DOM/Xpath plugin, and I can create an XPath Query that I can use within the Firefox/Plugin to generate the correct resulting list of items/elements based on the XPath Query. However, when i then use the same XPath Query, and the same wep page in my test app, i get the warnings/errors from the perl xpath/libxml functions....
i'm wondering if there's a way that i can call the Firefox Engine (using the Plugins, and have it do all the processing/parsing) and let it return the list of items/elements to me.....
-bruce
-----Original Message----- From: Antoine [mailto:melser.anton@gmail.com] Sent: Saturday, July 01, 2006 12:30 AM To: bedouglas@earthlink.net; For users of Fedora Core releases Subject: Re: developing using the firefox engine
On 01/07/06, bruce bedouglas@earthlink.net wrote:
hi...
i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
I have been having a similar problem with html, though this time the guilty party is mshtml. That piece of dog vomit *can not be made* to produce xhtml!!! I get the pseudo-html from mshtml, run it through sgmlreader+converter class and get (x)html out. I can then parse + process the file with standard xml/xsl tools. It took me an age to find good things for .net though - you shouldn't have nearly as many problems on linux/fedora. I suggest you pass it through tidy and get xhtml out. It may give you some junk but you don't really have many other options... Cheers Antoine
-- This is where I should put some witty comment.
On Fri, 2006-06-30 at 19:17 -0700, bruce wrote:
i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
You haven't mentioned what your sources are. Can you fix the broken HTML before trying to parse it? Whether that be fixing up your own web pages, or fixing other's broken pages with some sort of proxy between their server and your system.