Here is a copy of the rather lengthy comment that I posted on Ben Hammersley’s weblog concerning a discussion about a new news aggregator product called NewsMonster:
I haven’t tried NewsMonster yet, but based on the discussion, it appears that the functionality that it most closely resembles is the “Offline Web Pages” feature of Internet Explorer for Windows. It also would appear that most people contributing to this discussion have not used this feature before, and therefore don’t appreciate just how valuable it is. If you haven’t used it, here’s a quick overview:
Offline Web Pages drives Internet Explorer just as if a live user were driving it. It stores complete web pages and all linked images and other content elements in IE’s regular cache. Its completely user configurable: it can store complete sites or just single pages depending on the URL; it can recursively dive down up to 3 (I think) levels deep; it can follow links to “outside” sites or stay within the domain specified by the initial URL; it can run on a schedule, on system events like startup or shutdown, or on demand; it can traverse and cache a single site, or a whole list of sites.
From the user’s perspective, you just run IE, put it into offline mode, then browse the site(s) as you would normally. There’s no difference between that and browsing the site online, except that the offline experience is blazingly fast, much faster than browsing online even over DSL or other broadband.
The way I used to use this feature was as follows: I have a half-hour train ride to and from work every day. I had my laptop set to download a list of sites every weekday morning at 5 a.m. and again in the afternoon at 4 p.m. The sites included CNET, NYT-Tech, Wired, GMSV and a few others. I could then read the news on the train using my laptop with IE in offline mode. This was a tremendous time-saver for me. I’ve since switched to using a Pocket PC for the train ride, but I still use Offline Web Pages for a few sites that I look at in the evenings at home.
Remember that the vast majority of web users still are stuck with 56K dialup, and will be for years to come. Using Offine Web Pages vastly improves the experience of browsing the web in that environment, as well as extending the availability of the web into situations where it isn’t currently accessable. Are Offline Web Pages inefficient from a server perspective? Certainly. Nevertheless, the feature is invaluable under certain circumstances.
KEY POINT: If Offline Web Pages obeyed the Robot Exclusion Protocol, it would render this valuable feature completely useless.
So what the answer? First, is to recognize that IE’s Offline Web Pages and (apparently) NewsMonster are neither robots in the “classic” sense of search engines, nor are they flesh-and-blood users, but are a hybrid of the two. The solution should be twofold:
First, the offline user agents need to be very smart and efficient. They shouldn’t try and download content that they already have in their cache. (Sites like CNET which have multiple CMS-generated URLs that point to the same article complicate this.) And they should try and learn from the user’s history and only download pages that the user is likely to actually read–which is easier said than done!
Second, the Robot Exclusion Protocol is ancient by Internet standards, and could probably use an update to better handle this situation. Perhaps it could redirect bots to an alternate URL which would allow them operate more efficiently. Or maybe there’s already some other technology which would be more appropriate.