Pragana's Tcl Guide


Old notes


A custom robot for your fetch needs

Sometimes I need robots to do the web fetching of many pages at once. I don't like to stay clicking links to achieve the same result by hand. But not always the links are simply numbered pages. You may have to look for some logic for navigating, and include that logic in your robot to get the results.

Our study goal is to fetch images from the well known User Friendly cartoon, by Illiad. It has a cool collection of geek stories about custumer support. Looking at their archives, you may find the links that bring us each archived strip is something like http://ars.userfriendly.org/cartoons/?id=20020618&mode=classic where the date is 20020618. If you fetch that url, you will get a page with several pictures, but just one is what we need, the cartoon strip picture. Try to find something in the link that is unique. In this case, it is simple: <img border="0" src="url of the image". You will have to look at the html code for the page (use vim or other editor to search) to guarantee the link template is unique...Don't be too lazy!

Our small robot will then:

Let's solve each problem. First how to generate sequential date strings? In tcl, there is a powerful clock command with an option to scan natural language strings and convert them to a numeric unix date. Suppose we want what date is Nov 20, 1997, then the command clock scan "Nov 20, 1997" will return the number 879994800, which is that date in unix format. What's interesting, we may give dates to clock scan in a more complex format, like Nov 20, 1997 + 5 days. We will use that property to iterate over dates, without messy date conversions. Now it is easy to convert this into the date string we need for our url, which shall be 19971120. The command? The same clock, but with another option, format. Here it is: clock format $date -format %Y%m%d. The format string options are %Y (the year as four digit), %m (the month), %d (the day). Isn't that easy?

Ok, you already know how to fetch pages with tcl's http library. But we need to fetch twice, first to get the refering html page, then to get the final image we want. The answer? Our old friend the regexp command. Here is the regular expression that do the job: <img border="0" src="([^"]*) That's because we need first to have some unique context (as we described above) that don't show elsewhere in the page, then a sub-expression to separate the image's url we need. This is the easy part, as no quotes are to be found in the url.

Enough of talking! Let's see the full code of our tiny robot. It will fetch 1700 User Friendly daily strips starting Nov 20, 1997. The code is so simple I wonder how many lines it would take if written in Java...

package require http

for {set n 0} {$n < 1700} {incr n} {
    set datestr "Nov 20, 1997 + $n days"
    set id [clock format [clock scan $datestr] -format "%Y%m%d"]
    set h [http::geturl http://ars.userfriendly.org/cartoons/?id=$id&mode=classic]
    puts "geturl $id"
    update
    if {[http::ncode $h]!= 404} {
        set data [http::data $h]
        regexp {<img border="0" src="([^"]*)} $data m imgurl
        http::cleanup $h
        set h [http::geturl $imgurl]
        set data [http::data $h]
        set f [open $id.gif w]
        puts $f $data
        close $f
    }
    http::cleanup $h
}   

That's all fellows. Happy hacking!


Back Home