Hi all;
I would like to pull data from some specific web pages that have for sale ads and insert the data into a database. Anyone know ofany tools that can help me with this?
On Wed, 2005-09-21 at 14:48 -0600, kevin.kempter@dataintellect.com wrote:
Hi all;
I would like to pull data from some specific web pages that have for sale ads and insert the data into a database. Anyone know ofany tools that can help me with this?
If you are familiar with Perl you could use the LWP package to simulate that you are a browser.
Albert
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
-----Original Message----- From: fedora-list-bounces@redhat.com [mailto:fedora-list-bounces@redhat.com] On Behalf Of Albert A. Modderkolk Sent: Wednesday, September 21, 2005 3:23 PM To: For users of Fedora Core releases Subject: Re: OT - Tool to pull data from the web?
On Wed, 2005-09-21 at 14:48 -0600, kevin.kempter@dataintellect.com wrote:
Hi all;
I would like to pull data from some specific web pages that
have for sale ads
and insert the data into a database. Anyone know ofany
tools that can help me
with this?
If you are familiar with Perl you could use the LWP package to simulate that you are a browser.
Albert
Yeah. It would depend on the data. You might be able to use wget, but even with that, you may want to wrap it in a (perl) script to filter out the junk you don't want.
- - Charles Heselton charles.heselton <at> gmail <dot> com CELL: 619-261-6866
PGP KEY: 5A27 58D2 C791 8769 D4A4 F316 7BF8 D1F6 4829 EDCF
In memoriam: http://www.militarycity.com/valor/1029976.html
On Thu, 2005-09-22 at 14:57 -0700, Charles Heselton wrote:
Yeah. It would depend on the data. You might be able to use wget, but even with that, you may want to wrap it in a (perl) script to filter out the junk you don't want.
By the way, how can I write an regexp that will remove ANY junk-pattern i find in a page with ANY ANOTHER pattern? - Doesn't matter whither it be an href-content, simple text html-formatting tags set, etc. I write: $content=~s#$a#$b#g; But it behaves itself weird.