Why Cache the Web? ~ Vinay's Blog

Sep 29, 2009

Why Cache the Web?

Cache, Clusters, HTTP server, Meshes, performance tuning, web cache, web Caching, web pages load faster, web profilingNo comments

The short answer is that caching saves money. It saves time as well, which is sometimes the same thing if you believe that "time is money." But how does caching save you money?

It does so by providing a more efficient mechanism for distributing information on the Web. Consider an example from our physical world: the distribution of books. Specifically, think about how a book gets from publisher to consumer. Publishers print the books and sell them, in large quantities, to wholesale distributors. The distributors, in turn, sell the books in smaller quantities to bookstores. Consumers visit the stores and purchase individual books. On the Internet, web caches are analogous to the bookstores and wholesale distributors.

The analogy is not perfect, of course. Books cost money; web pages (usually) don't. Books are physical objects, whereas web pages are just electronic and magnetic signals. It's difficult to copy a book, but trivial to copy electronic data.

The point is that both caches and bookstores enable efficient distribution of their respective contents. An Internet without caches is like a world without bookstores. Imagine 100,000 residents of India each buying one copy of Comics Book from the publisher in India. Now imagine 50,000 Internet users in Australia each downloading the Yahoo! home page every time they access it. It's much more efficient to transfer the page once, cache it, and then serve future requests directly from the cache.

In order for caching to be effective, the following conditions must be met:

Client requests must exhibit locality of reference.
The cost of caching must be less than the cost of direct retrieval.

We can intuitively conclude that the first requirement is true. Certain web sites are very popular. Classic examples are the starting pages for Netscape and Microsoft browsers. Others include searching and indexing sites such as Yahoo! and Altavista. Event-based sites, such as those for the Olympics, NASA's Mars Pathfinder mission, and World Cup Soccer, become extremely popular for days or weeks at a time. Finally, every individual has a few favorite pages that he or she visits on a regular basis.

It's not always obvious that the second requirement is true. We need to compare the costs of caching to the costs of not caching. Numerous factors enter into the analysis, some of which are easier to measure than others. To calculate the cost of caching, we can add up the costs for hardware, software, and staff time to administer the system. We also need to consider the time users save waiting for pages to load (latency) and the cost of Internet bandwidth.

Three primary benefits of caching web content:

To make web pages load faster (reduce latency)
To reduce wide area bandwidth usage
To reduce the load placed on origin servers

Types of Web Caches

Web content can be cached at a number of different locations along the path between a client and an origin server. First, many browsers and other user agents have built-in caches. For simplicity, I'll call these browser caches. Next, a caching proxy (a.k.a. "proxy cache") aggregates all of the requests from a group of clients. Lastly, a surrogate can be located in front of an origin server to cache popular responses.

Browser Caches

Browsers and other user agents benefit from having a built-in cache. When you press the Back button on your browser, it reads the previous page from its cache. Nongraphical agents, such as web crawlers, cache objects as temporary files on disk rather than keeping them in memory.

Netscape Navigator lets you control exactly how much memory and disk space to use for caching, and it also allows you to flush the cache. Microsoft Internet Explorer lets you control the size of your local disk cache, but in a less flexible way. Both have controls for how often cached responses should be validated. People generally use 10–100MB of disk space for their browser cache.

A browser cache is limited to just one user, or at least one user agent. Thus, it gets hits only when the user revisits a page. As we'll see later, browser caches can store "private" responses, but shared caches cannot.

Caching Proxies

Caching proxies, unlike browser caches, service many different users at once. Since many different users visit the same popular web sites, caching proxies usually have higher hit ratios than browser caches. As the number of users increases, so does the hit ratio.

Caching proxies are essential services for many organizations, including ISPs, corporations, and schools. They usually run on dedicated hardware, which may be an appliance or a general-purpose server, such as a Unix or Windows NT system. Many organizations use inexpensive PC hardware that costs less than $1,000 / 45000 Rs. At the other end of the spectrum, some organizations pay hundreds of thousands of dollars/ Rupees, or more, for high-performance solutions from one of the many caching vendors.

Caching proxies are normally located near network gateways (i.e., routers) on the organization's side of its Internet connection. In other words, a cache should be located to maximize the number of clients that can use it, but it should not be on the far side of a slow, congested network link.

Caching Proxy Features

The key feature of a caching proxy is its ability to store responses for later use. This is what saves you time and bandwidth. Caching proxies actually tend to have a wide range of additional features that many organizations find valuable. Most of these are things you can do only with a proxy but which have relatively little to do with caching. For example, if you want to authenticate your users, but don't care about caching, you might use a caching proxy product anyway.

Authentication: A proxy can require users to authenticate themselves before it serves any requests. This is particularly useful for firewall proxies. When each user has a unique username and password, only authorized individuals can surf the Web from inside your network. Furthermore, it provides a higher quality audit trail in the event of problems.
Request filtering: Caching proxies are often used to filter requests from users. Corporations usually have policies that prohibit employees from viewing pornography at work. To help enforce the policy, the corporate proxy can be configured to deny requests to known pornographic sites. Request filtering is somewhat controversial. Some people equate it with censorship and correctly point out that filtering schemes are not perfect.
Response filtering: In addition to filtering requests, proxies can also filter responses. This usually involves checking the contents of an object as it is being downloaded. A filter that checks for software viruses is a good example. Some organizations use proxies to filter out Java and JavaScript code, even when it is embedded in an HTML file. I've also heard about software that attempts to prevent access to pornography by searching images for a high percentage of flesh-tone pixels.
For example, see http://www.heartsoft.com, http://www.eye-t.com, and http://www.thebair.com.
Prefetching: Prefetching is the process of retrieving some data before it is actually requested. Disk and memory systems typically use prefetching, also known as "read ahead." For the Web, prefetching usually involves requesting images and hyperlinked pages referenced in an HTML file. Prefetching represents a tradeoff between latency and bandwidth. A caching proxy selects objects to prefetch, assuming that a client will request them. Correct predictions result in a latency reduction; incorrect predictions, however, result in wasted bandwidth. So the interesting question is, how accurate are prefetching predictions? Unfortunately, good measurements are hard to come by. Companies with caching products that use prefetching are secretive about their algorithms.
Translation and transcoding: Translation and transcoding both refer to processes that change the content of something without significantly changing its meaning or appearance. For example, you can imagine an application that translates text pages from English to German as they are downloaded. Transcoding usually refers to low-level changes in digital data rather than high-level human languages. Changing an image file format from GIF to JPEG is a good example. Since the JPEG format results in a smaller file than GIF, they can be transferred faster. Applying general-purpose compression is another way to reduce transfer times. A pair of cooperating proxies can compress all transfers between them and uncompress the data before it reaches the clients.
Traffic shaping: A significant number of organizations use application layer proxies to control bandwidth utilization. In some sense, this functionality really belongs at the network layer, where it's possible to control the flow of individual packets. However, the application layer provides extra information that network administrators find useful. For example, the level of service for a particular request can be based on the user's identification, the agent making the request, or the type of data being requested (e.g., HTML, Postscript, MP3).

Meshes, Clusters, and Hierarchies

There are a number of situations where it's beneficial for caching proxies to talk to each other. There are different names for some different configurations. A cluster is a tightly coupled collection of caches, usually designed to appear as a single service. That is, even if there are seven systems in a cluster, to the outside world it looks like just one system. The members of a cluster are normally located together, both physically and topologically.

A loosely coupled collection of caches is called a hierarchy or mesh. If the arrangement is tree-like, with a clear distinction between upper- and lower-layer nodes, it is called a hierarchy. If the topology is flat or ill-defined, it is called a mesh. A hierarchy of caches make sense because the Internet itself is hierarchical. However, when a mesh or hierarchy spans multiple organizations, a number of issues arise.

List of caching products

Squid

http://www.squid-cache.org

Squid is an open source software package that runs on a wide range of Unix platforms. There has also been some recent success in porting Squid to Windows NT. As with most free software, users receive technical support from a public mailing list. Squid was originally derived from the Harvest project in 1996.

Netscape Proxy Server

http://home.netscape.com/proxy/v3.5/index.html

The Netscape Proxy Server was the first caching proxy product available. The lead developer, Ari Luotonen, also worked extensively on the CERN HTTP server during the Web's formative years in 1993 and 1994. Netscape's Proxy runs on a handful of Unix systems, as well as Windows NT.

Microsoft Internet Security and Acceleration Server

http://www.microsoft.com/isaserver/

Microsoft currently has two caching proxy products available. The older Proxy Server runs on Windows NT, while the newer ISA product requires Windows 2000.

Volera

http://www.volera.com

Volera is a recent spin-off of Novell. The product formerly known as Internet Caching System (ICS) is now called Excelerator. Volera does not sell this product directly. Rather, it is bundled on hardware appliances available from a number of OEM partners.

Network Appliance Netcache

http://www.netapp.com/products/netcache/

Network Appliance was the second company to sell a caching proxy, and the first to sell an appliance. The Netcache products also have roots in the Harvest project.

Inktomi Traffic Server

http://www.inktomi.com/products/network/traffic/

Inktomi boasts some of the largest customer installations, such as America Online and Exodus. Their Traffic Server product has been available since 1997.

CacheFlow

http://www.cacheflow.com

Intelligent prefetching and refreshing features distinguish CacheFlow from their competitors.

InfoLibria

http://www.infolibria.com

InfoLibria's products are designed for high reliability and fault tolerance.

Cisco Cache Engine

http://www.cisco.com/go/cache/

The Cisco 500 series Cache Engine is a small, low-profile system designed to work with their Web Cache Control Protocol (WCCP). As your demand for capacity increases, you can easily add more units.

Lucent imminet WebCache

http://www.lucent.com/serviceprovider/imminet/

Lucent's products offer carrier-grade reliability and active refresh features.

iMimic DataReactor

http://www.imimic.com

iMimic is a relative newcomer to this market. However, their DataReactor product is already licensed to a number of OEM partners. iMimic also sells their product directly.

Sep 29, 2009

Why Cache the Web?

0 comments:

Popular Posts

Recent Posts

Categories

Unordered List

Pages

Blog Archive

Text Widget