14 February 2006 ~ 1 Comment

Finding URLs in IRC

Yet more IRC bot coding tonight. This time my goal was to have the bot look for any URLs pasted to the chan. For each one found, it should grab and return the title for the page.

The first step was to make a nice regex that will find most URLs. Here’s the one I came up with:

url_pattern = re.compile('http://[\w]+.[\S]+[\w]')

Next, in Bot.privmsg(), I loop through any URLs found and print the title using urllib and BeautifulSoup:

for url in url_pattern.findall(msg):
try:
sock = urllib.urlopen(url)
pagetext = sock.read()
sock.close()
soup = BeautifulSoup(pagetext)
soup.done()
pagetitle = soup.title.string.strip()
self.msg(channel,"Title: %s (at %s)" % (pagetitle, url))
print "URL: %s (%s)" % (url, pagetitle)
except IOError:
print "Error: Unable to urlopen %s" % url

Again, like my last attempt, a lot could be done to improve this. But it works.

Also, if you are doing any sort of screen scraping or HTML parsing and haven’t yet tried BeautifulSoup, you really should. It makes sense out of the absolute worst HTML and allows you to easily target data buried in piles of markup. Highly recommended.

One Response to “Finding URLs in IRC”


Leave a Reply