It does so by providing a more efficient mechanism for distributing information on the Web. Consider an example from our physical world: the distribution of books. Specifically, think about how a book gets from publisher to consumer. Publishers print the books and sell them, in large quantities, to wholesale distributors. The distributors, in turn, sell the books in smaller quantities to bookstores. Consumers visit the stores and purchase individual books. On the Internet, web caches are analogous to the bookstores and wholesale distributors.
The analogy is not perfect, of course. Books cost money; web pages (usually) don't. Books are physical objects, whereas web pages are just electronic and magnetic signals. It's difficult to copy a book, but trivial to copy electronic data.
The point is that both caches and bookstores enable efficient distribution of their respective contents. An Internet without caches is like a world without bookstores. Imagine 100,000 residents of India each buying one copy of Comics Book from the publisher in India. Now imagine 50,000 Internet users in Australia each downloading the Yahoo! home page every time they access it. It's much more efficient to transfer the page once, cache it, and then serve future requests directly from the cache.
In order for caching to be effective, the following conditions must be met:
- Client requests must exhibit locality of reference.
- The cost of caching must be less than the cost of direct retrieval.
It's not always obvious that the second requirement is true. We need to compare the costs of caching to the costs of not caching. Numerous factors enter into the analysis, some of which are easier to measure than others. To calculate the cost of caching, we can add up the costs for hardware, software, and staff time to administer the system. We also need to consider the time users save waiting for pages to load (latency) and the cost of Internet bandwidth.
Three primary benefits of caching web content:
- To make web pages load faster (reduce latency)
- To reduce wide area bandwidth usage
- To reduce the load placed on origin servers
Types of Web Caches
Web content can be cached at a number of different locations along the path between a client and an origin server. First, many browsers and other user agents have built-in caches. For simplicity, I'll call these browser caches. Next, a caching proxy (a.k.a. "proxy cache") aggregates all of the requests from a group of clients. Lastly, a surrogate can be located in front of an origin server to cache popular responses.
Browsers and other user agents benefit from having a built-in cache. When you press the Back button on your browser, it reads the previous page from its cache. Nongraphical agents, such as web crawlers, cache objects as temporary files on disk rather than keeping them in memory.
Netscape Navigator lets you control exactly how much memory and disk space to use for caching, and it also allows you to flush the cache. Microsoft Internet Explorer lets you control the size of your local disk cache, but in a less flexible way. Both have controls for how often cached responses should be validated. People generally use 10–100MB of disk space for their browser cache.
A browser cache is limited to just one user, or at least one user agent. Thus, it gets hits only when the user revisits a page. As we'll see later, browser caches can store "private" responses, but shared caches cannot.
Caching proxies, unlike browser caches, service many different users at once. Since many different users visit the same popular web sites, caching proxies usually have higher hit ratios than browser caches. As the number of users increases, so does the hit ratio.
Caching proxies are essential services for many organizations, including ISPs, corporations, and schools. They usually run on dedicated hardware, which may be an appliance or a general-purpose server, such as a Unix or Windows NT system. Many organizations use inexpensive PC hardware that costs less than $1,000 / 45000 Rs. At the other end of the spectrum, some organizations pay hundreds of thousands of dollars/ Rupees, or more, for high-performance solutions from one of the many caching vendors.
Caching proxies are normally located near network gateways (i.e., routers) on the organization's side of its Internet connection. In other words, a cache should be located to maximize the number of clients that can use it, but it should not be on the far side of a slow, congested network link.
Caching Proxy Features
The key feature of a caching proxy is its ability to store responses for later use. This is what saves you time and bandwidth. Caching proxies actually tend to have a wide range of additional features that many organizations find valuable. Most of these are things you can do only with a proxy but which have relatively little to do with caching. For example, if you want to authenticate your users, but don't care about caching, you might use a caching proxy product anyway.
- A proxy can require users to authenticate themselves before it serves any requests. This is particularly useful for firewall proxies. When each user has a unique username and password, only authorized individuals can surf the Web from inside your network. Furthermore, it provides a higher quality audit trail in the event of problems.
- Request filtering
- Caching proxies are often used to filter requests from users. Corporations usually have policies that prohibit employees from viewing pornography at work. To help enforce the policy, the corporate proxy can be configured to deny requests to known pornographic sites. Request filtering is somewhat controversial. Some people equate it with censorship and correctly point out that filtering schemes are not perfect.
- Response filtering
- Prefetching is the process of retrieving some data before it is actually requested. Disk and memory systems typically use prefetching, also known as "read ahead." For the Web, prefetching usually involves requesting images and hyperlinked pages referenced in an HTML file. Prefetching represents a tradeoff between latency and bandwidth. A caching proxy selects objects to prefetch, assuming that a client will request them. Correct predictions result in a latency reduction; incorrect predictions, however, result in wasted bandwidth. So the interesting question is, how accurate are prefetching predictions? Unfortunately, good measurements are hard to come by. Companies with caching products that use prefetching are secretive about their algorithms.
- Translation and transcoding
- Translation and transcoding both refer to processes that change the content of something without significantly changing its meaning or appearance. For example, you can imagine an application that translates text pages from English to German as they are downloaded. Transcoding usually refers to low-level changes in digital data rather than high-level human languages. Changing an image file format from GIF to JPEG is a good example. Since the JPEG format results in a smaller file than GIF, they can be transferred faster. Applying general-purpose compression is another way to reduce transfer times. A pair of cooperating proxies can compress all transfers between them and uncompress the data before it reaches the clients.
- Traffic shaping
- A significant number of organizations use application layer proxies to control bandwidth utilization. In some sense, this functionality really belongs at the network layer, where it's possible to control the flow of individual packets. However, the application layer provides extra information that network administrators find useful. For example, the level of service for a particular request can be based on the user's identification, the agent making the request, or the type of data being requested (e.g., HTML, Postscript, MP3).
There are a number of situations where it's beneficial for caching proxies to talk to each other. There are different names for some different configurations. A cluster is a tightly coupled collection of caches, usually designed to appear as a single service. That is, even if there are seven systems in a cluster, to the outside world it looks like just one system. The members of a cluster are normally located together, both physically and topologically.
A loosely coupled collection of caches is called a hierarchy or mesh. If the arrangement is tree-like, with a clear distinction between upper- and lower-layer nodes, it is called a hierarchy. If the topology is flat or ill-defined, it is called a mesh. A hierarchy of caches make sense because the Internet itself is hierarchical. However, when a mesh or hierarchy spans multiple organizations, a number of issues arise.
List of caching products