Finding URLs in IRC

Yet more IRC bot coding tonight. This time my goal was to have the bot look for any URLs pasted to the chan. For each one found, it should grab and return the title for the page.

The first step was to make a nice regex that will find most URLs. Here’s the one I came up with:

Next, in Bot.privmsg(), I loop through any URLs found and print the title using urllib and BeautifulSoup:

for url in url_pattern.findall(msg): try: sock = urllib.urlopen(url) pagetext = sock.read() sock.close() soup = BeautifulSoup(pagetext) soup.done() pagetitle = soup.title.string.strip() self.msg(channel,"Title: %s (at %s)" % (pagetitle, url)) print "URL: %s (%s)" % (url, pagetitle) except IOError: print "Error: Unable to urlopen %s" % url

Again, like my last attempt, a lot could be done to improve this. But it works.

Also, if you are doing any sort of screen scraping or HTML parsing and haven’t yet tried BeautifulSoup, you really should. It makes sense out of the absolute worst HTML and allows you to easily target data buried in piles of markup. Highly recommended.

One Response to “Finding URLs in IRC”

  1. import this. » Blog Archive » Adding methods to a class Says:

    [...] I spent some time tonight trying to build a system to handle plug-ins. I’ve been working on an IRC bot, and I’d like to be able to load individual modules to add functionality to it. Definitely one of the trickier projects that I’ve worked on up to this point. Although, in the end, it ended up being rather simple. [...]

Leave a Reply