Cache-22

Copying and storing Web pages is vital to the Internet’s survival — but is it legal?

One night, while browsing the World Wide Web through his America Online account, a colleague of mine came across a site that linked to our firm’s Web site (www.cooley.com). In a moment of curiosity, he clicked on the link to retrieve the firm’s home page. To his horror, the page he received was a long-outdated version that did not reflect our firm’s recent investments to improve the site. The attorney logged out wondering why the site that linked to our firm’s Web site had chosen to link to an outdated page.

My colleague, and our firm, were the victims of “caching” by America Online.

Okay, but what is “caching”?

Caching is a nebulous term, tautologically defined as the process of storing something in a place of storage. On the Internet, caching occurs at multiple levels.

First, many browsers “locally” cache, or store recently visited Web pages in the computer’s RAM — random access memory. For example, a person running Netscape Navigator who selects the “back” button will, most times, retrieve a page from RAM instead of receiving a “fresh” copy of that page downloaded from the actual Web site.

Caching also occurs at the server level — termed “proxy” caching. The most obvious users of proxy caching are online services such as AOL, Compuserve and Prodigy, that store the most frequently requested pages on their own computers. Then, when a user requests a page that has been cached, the online service will deliver a copy from its own computers’ memory — not from the Web site in question. This is exactly what happened to our intrepid attorney.

TO CACHE. . .

On the Internet, caching speeds user access to Web pages and reduces demands on a limited infrastructure. The following diagram indicates the typical data flow of a Web page requested by a user:

[Subject Web site]
[Subject Web site’s connection to the Internet]
[Internet]
[Requester’s connection to the Internet]
[Requester’s computer]

Several of these levels are subject to congestion and therefore can benefit from caching.

First, the Web site’s server may be overloaded and therefore unable to process requests. Second, the Web site may have an inadequate connection to the Internet or may be using an intermediate access provider that is subject to its own congestion. Third, the Internet is subject to congestion, as the data is broken into packets and sent via potentially congested pipelines to different computers that may be backlogged with other data to process. Fourth, the requester’s access provider may be congested. Finally, the requester — the person accessing the Web — may have an insufficient connection to the Internet or be running an underpowered computer.

If some of these steps are bypassed, the benefits are obvious: With fewer stops at potentially congested sites, the data is delivered faster. Conversely, if every request for every Web page was filled by going through the full process described above rather than from a cache, the increased data flow could easily overwhelm the already-teetering infrastructure of the Internet, making it a victim of its own success. Not only would such demand interfere with the smooth operation of the Web, but it would affect the flow of all data on the Internet — e-mail, net telephony, FTP, and so on.

. . .OR NOT TO CACHE

While caching is currently essential to the successful functioning of the Web, caching (and in particular proxy caching) has created a number of problems for both users and publishers of content.

The first major problem, as the opening example showed, is that caching interferes with the ability of Web sites to control what data is delivered to people requesting a page. In our law firm’s case, AOL users could believe, incorrectly, that our firm is not investing resources to maintain our Web presence — and in the profession of law, where image is critical, this perception could harm the firm’s practice.

However, Web page owners’ lack of control over the caching process can have even more insidious results. For example, imagine that a Web publisher discovers that some information she has posted is harmful in some way — perhaps the information is inaccurate or infringes someone’s copyright. Even if the publisher discovers this problem and corrects it on her site, the harmful information will be disseminated to end users until all caches containing the old version of the page are refreshed. Furthermore, if users do not know they are receiving pages from a cache, they may incorrectly assume they are getting up-to-date information. If someone is seeking real-time stock market quotes or does not realize that an analysis of the law has been mooted by subsequent developments, the consequences could be painful or expensive.

Also, consider Web sites that sell advertising on a time-sensitive basis — such as a banner ad slot between 6:30 and 7 p.m. Unless all proxy caches are refreshed precisely at the beginning and end of the period for which advertising is sold, such Web sites cannot successfully implement this plan — either the ad will get less time (possibly no time, if the cache does not refresh at all during the ad’s time slot) or more time (for example, if the cache refreshes at 6:30 p.m. and then does not refresh for another 48 hours) than was paid for.

The second major problem is that caching interferes with Web sites’ analysis of their users. This problem is most pointed for Web sites that charge advertisers based on the amount of data delivered to users.

For example, most major advertising-driven Web sites, such as HotWired, Pathfinder and Netscape, charge based on the number of times a banner advertisement is displayed to users (often called “page impressions”). Since a cached page is downloaded from the cache and not the actual owner’s Web site, the Web site owner does not know whether or how often a given page was viewed from the cache, and cannot charge their advertisers for such page impressions. Predictably, this makes advertising-driven Web site owners unhappy, since caching means lost revenues. In fact, page impression data is so valuable to advertising-driven Web sites that at least one online service markets to Web site owners data about the number of page impressions delivered from its cache.

While page impression data is important to Web site owners, Web site owners can extract value by understanding user activity in other ways. Indeed, a whole science has developed to analyze “server logs,” which record the activities of Web site users. Again, when this data flows to the online service and not the Web site, the Web site is unable to realize the value of its relationship with users.

Finally, the caching entity itself faces some peril from proxy caching. Under some paradigms of online law (which are very much in flux right now), a caching entity could possibly be liable for claims of defamation, invasion of privacy and other torts faced by publishers/republishers. The case law also suggests that a caching entity could be liable for copyright infringement, both of the Web sites being cached, and of third parties if the cached pages have infringed others’ copyrights. Last but not least, the proxy cache could contain pornography or obscene materials, which creates the possibility of being liable, or at least harassed by zealous prosecutors, for such material.

THE LAW OF THE CACHE

Caching implicates a number of the exclusive rights of copyright holders under 17 U.S.C. §106, including (i) reproduction (by making an extra copy into RAM or possibly a hard drive) and, in the case of proxy caching, (ii) distribution, (iii) public display, and possibly (iv) public performance and (v) digital performance. Despite the sometimes illogical results reached by treating a copy in RAM as copyright infringement, this result has been consistently reached by the courts.

However, caching may be “fair use” under 17 U.S.C. §107, providing a defense to an infringement action. The multi-factor fair-use test considers the purpose of the use; whether the infringed work was published or unpublished and was fact or fiction; the amount and substantiality of the portion taken; and the effect of the infringement on the market for the work. Given its multi-factor analysis, litigation over whether a use was fair tends to be cumbersome. The following is a preliminary breakdown of how the factors might play out in a suit:

Purpose of use. Proxy caches normally operate to benefit customers and reduce investment in infrastructure. Thus, such caching has a commercial purpose. Although the facts will differ in each case, this factor is more likely to weigh against fair use.
Fact/fiction; published/unpublished. All material available on the Internet is by definition “published” for purposes of copyright. Whether the cached work is fact or fiction will depend on the specific circumstances.
Amount and substantiality of portion taken. Caches almost invariably make a copy of entire Web pages, which in turn may have a number of elements — graphics, for example — that are subject to their own copyright. In these situations, the amount taken will be 100 percent of copyrighted works, which usually (but not always) precludes a finding of fair use.
Effect on the market. Under copyright jurisprudence, this is the most important factor. While it is difficult to define the “market” for Web pages that are made available for free, caching causes Web sites to undercount page impressions — information of value — so caching could be deemed to interfere with the market for page impression data. On the other hand, for pages that do not sell advertising to third parties (www.cooley.com, for example), it is very difficult to define what “market” is interfered with by caching. Because it is difficult to know how a court will analyze this factor, it is equally difficult to know if caching will be deemed fair use.

The fair-use factors, perhaps predictably, lead to no definitive answers. Thus, relying on fair use to justify caching, particularly proxy caching, is a precarious position under existing copyright law. In fact, in one recent transaction I negotiated with a major online service, the service added a “license to cache” my client’s Web site to an agreement that otherwise had nothing to do with caching — presumably to remove any doubt.

Although proxy caching appears to be more problematic than local caching, it is not clear that local caching will be free from suit. For example, such a suit might arise in the case of a large company where the cumulative effects of local caching by many Web browsers (perhaps combined with statutory damages and attorneys fees) are significant.

One last point under copyright law: Some have argued that caching is permitted under an “implied license” that Web site owners grant simply by making their content available over the Internet. This argument makes sense in that if Web site owners want users to browse — that is, load pages into RAM, and thereby make a copy — they must grant an implied license to make that copy. However, as applied to caching, the argument for implied license is predicated on the existence of technology that permits Web sites to control caching — and therefore, any Web site that fails to use such technology grants an implied license to cache. However, there is no other place under existing copyright law where copyright holders’ failure to use technology to reduce infringement creates an implied license to infringe. Indeed, placing the burden on copyright holders would be inconsistent with the general legislative trend toward increased protection for copyright holders.

CACHE ME IF YOU CAN

Some existing technologies allow Web sites to control the caching process. First, Web sites can create “dynamic pages” that are displayed to users only after the user initiates a server-resident program (a “cgi script”). While this solves the problem, cgi scripts are currently somewhat expensive to program. Second, Web sites can code their pages with “expiry” information, which tells the proxy cache when to refresh the cached page. However, there are no current standards for recognizing expiry information, so Web sites that properly code their pages may find that some proxy caches ignore their instructions. Furthermore, because of the danger of persistent inaccurate information, Web sites have incentives to make the expiration time short or to code the page so that it will not be cached at all, which then reduces the benefits of caching for everyone.

As for copyright law, any non-technological solution to permit caching will need to be legislative. Prior situations where fair-use doctrines were stretched have been clarified by amending the copyright law. Examples include copying into RAM done by computers during normal operation (17 U.S.C. §117), and the “ephemeral recordings” made by broadcasters (17 U.S.C. §112), both now protected uses under copyright law. Without a similar legislative response, confusion over caching will reign.

Whether or not caching should be permitted under copyright law ultimately depends on a policy determination about the operation of the Internet. Treating caching as an infringement will increase data flows over the Internet and likely overwhelm its existing infrastructure, which in turn will require enormous investments in infrastructure expansion or a sure-to-be-unpopular congestion/metering pricing scheme. On the other hand, if caching is legal, then standards must be developed to allow Web sites to be able to control the way their information is published online.

Because caching is fundamental to the Internet, a non-technological solution to the problem will require Solomon-like wisdom. In the absence of such wisdom, we can expect to hear plenty more about caching in the near future. text.

Eric Schlachter is an attorney practicing cyberspace law with Cooley Godward Castro Huddleson & Tatum. He is also an adjunct professor of cyberspace law at Santa Clara University School of Law.

Cache-22 by Eric Goldman

Cache-22