Oct 22, 2009

Fighting the web robots

The internet bots have become major problem. These bots used to register for E-mail addresses that are later used to send unwanted ads, or spam, to e-mail users. CAPTCHA is a standard security technology. The most widely used CAPTCHAs rely on the sophisticated distortion of text images rendering them unrecognizable to the state of the art of pattern recognition techniques, and these text-based schemes have found widespread applications in commercial websites.


The term "CAPTCHA" [Completely Automated Public Turing test to tell Computers and Humans ] (based upon the word capture) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper and John Langford. It is a contrived acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart."

A CAPTCHA system is a means of automatically generating new challenges which:

  • Current software is unable to solve accurately.
  • Most humans can solve
  • Note: The visually disabled who rely on screen reading technology cannot solve a visual CAPTCHA, thus limiting or preventing their access to some sites.

This paper describes how security to a website can be provided using CAPTCHA to prove that they are a real person, and not just a spambot, or some other computerized agent trawling the Web for exploits.


In 1997 Alta Vista sought ways to block or discourage the automatic submission of URLs to their search engine. This free "add-URL" service was important to AltaVista since it broadens its search coverage. Yet some users were abusing the service by automating the submission of large number of URLS, in an effort to skew AltaVista's importance ranking algorithms.

Andrei Broder, Chief Scientist of AltaVista, and his colleagues developed a filter. Their method was to generate an image of printed text randomly so that machine vision (OCR) systems cannot read it but humans still can. A U.S. patent was issued in April 2001. In January 2002 Broder stated that the system had been in use for "over a year" and had reduced the number of "spam add-URL" by "over 95%." 

Yahoo's Chat Room Problem

In September 2000, Udi Manber of Yahoo described this "chat room problem" to researchers at CMU: 'bots' were joining on-line chat rooms and irritating the people there, by pointing them to advertising sites. How could all 'bots' be refused entry to chat rooms?

CMU's Prof. Manual Blum, Luis A. von Ahn, and John Langford articulated some desirable properties of a test, including:

  • the test's challenges can be automatically generated and graded 
  • the test can be taken quickly and easily by human users
  • the test will accept virtually all human users with high reliability while rejecting very few
  • the test will reject virtually all machine users
  • the test will resist automatic attack for many years even as technology advances

CMU's CAPTCHA Research

The CMU team developed a 'hard' GIMPY CAPTCHA which picked English words at random and rendered them as images of printed text under a wide variety of shape deformations and image occlusions, the word images often overlapping. The user was asked to transcribe some number of the words correctly.

A simplified version of GIMPY (EZ GIMPYU), using only one word-image at a time, was installed by Yahoo, and is currently in use in their chat rooms to restrict access to only human users.


Why would anyone need to create a test that can tell humans and computers apart?

  • Let’s consider a scenario where a program is developed to create several free e-mail accounts.

This can affect millions of people across the world. These e-mail accounts can be used to spam millions of genuine users, send links to advertisements etc. If CAPTCHAs are present while registering for the free e-mail account, then it will be possible to distinguish if the user is a human or a bot. The same use will be found in several other Internet related registrations.

  • Even bookings for concerts can be made to the tune of thousands with the help of computer programs.

However this cannot happen in the presence of CAPTCHAs.

Sometimes there is a possibility of the failure of the CAPTCHAs to recognise these bots.

But every CAPTCHA failure is really an advance in artificial intelligence.

  • Spammers will post links back to their site in order to increase the number of links and make their site rank higher in search engines. This is called Comment spam and can be reduced through the use of a CAPTCHA.





  • Visual Types

It’s also possible to customize the image (captcha) using our color scheme, image styles (text alignment and decoration) and the text font we like.







Inclined & Dashed:


Horz Line:




Mixed & Dashed:


By choosing different colors you can create something of this type:


By choosing different fonts you can create something of this type:


  • Audio Types

This is for the visually impaired and most common type of CAPTCHA.

  • Problem solving Types

These require you to solve a problem that should be easy for a person but very hard for a computer to solve such as choosing which item in a list is not a bird for example, but the problem with this is that you need to have a large number of questions before it really becomes effective.


  • Registration forms on Web sites often use CAPTCHAs. For example, free Web-based e-mail services like Hotmail, Yahoo! Mail or Gmail allow people to create an e-mail account free of charge. Usually, users must provide some personal information when creating an account, but the services typically don't verify this information. They use CAPTCHAs to try to prevent spammers from using bots to generate hundreds of spam mail accounts.
  • Ticket brokers like TicketMaster also use CAPTCHA applications. These applications help prevent ticket scalpers from bombarding the service with massive ticket purchases for big events. Without some sort of filter, it's possible for a scalper to use a bot to place hundreds or thousands of ticket orders in a matter of seconds. Legitimate customers become victims as events sell out minutes after tickets become available. Scalpers then try to sell the tickets above face value. While CAPTCHA applications don't prevent scalping, they do make it more difficult to scalp tickets on a large scale.
  • Some Web pages have message boards or contact forms that allow visitors to either post messages to the site or send them directly to the Web administrators. To prevent an avalanche of spam, many of these sites have a CAPTCHA program to filter out the noise. It will help prevent bots from posting messages automatically.
  • The most common form of CAPTCHA requires visitors to type in a word or series of letters and numbers that the application has distorted in some way. Some CAPTCHA creators came up with a way to increase the value of such an application: digitizing books. An application called reCAPTCHA harnesses users responses in CAPTCHA fields to verify the contents of a scanned piece of paper. Because computers aren't always able to identify words from a digital scan, humans have to verify what a printed page says. Then it's possible for search engines to search and index the contents of a scanned document.
  • Here's how it works: First, the administrator of the reCAPTCHA program digitally scans a book. Then, the reCAPTCHA program selects two words from the digitized image. The application already recognizes one of the words. If the visitor types that word into a field correctly, the application assumes the second word the user types is also correct. That second word goes into a pool of words that the application will present to other users. As each user types in a word, the application compares the word to the original answer. Eventually, the application receives enough responses to verify the word with a high degree of certainty. That word can then go into the verified pool.
  • It sounds time consuming, but in this case the CAPTCHA is pulling double duty. It is not only verifying the contents of a digitized book, it's also verifying that the people filling out the form are actually people. In turn, those people gain access to the service they want to use.


  • One of the biggest drawbacks of CAPTCHA is that it relies on visual perception.

Users unable to view a CAPTCHA because of some disability or because they find the words are difficult to read will find CAPTCHAs difficult and also may find it hard to access the websites which make use of these CAPTCHAs for authentication.

Therefore it is suggested that sites using visual CAPTCHAs should also implement audio CAPTCHAs.

  • However even with audio and visual CAPTCHAs some users may require help (users with both hearing and visual disabilities).

There have been attempts at creating CAPTCHAs that are more accessible including mathematical questions, general questions etc.

  • However, none of these attempts meet both the criteria of being able to be automatically generated and not relying on the type of CAPTCHA being new to the attacker. Therefore, they are not CAPTCHAs and do not provide the protection that true CAPTCHAs provide.


CAPTCHA is a suitable technique to provide security and authenticate real users. However we do not live in a perfect world where all users are capable of handling these CAPTCHAs. There are those with disabilities and who may be troubled by their presence.

Sites with attractive resources and millions of users will always have a need for access control systems that limit widespread abuse. At that level, it is reasonable to employ many concurrent approaches, including audio and visual CAPTCHA, to do so. However, it must be noted that human users will fall through the cracks in these systems, and it will be necessary for sites like these to ensure that users with disabilities will have some human-operated means of interacting with a given resource in a reasonable amount of time.

An explicitly inaccessible access control mechanism should not be promoted as a solution, especially when other systems exist that are not only more accessible, but may be more effective, as well. It is strongly recommended that smaller sites adopt spam filtering and/or heuristic checks in place of CAPTCHA.

Lastly, new approaches should be found where the human users with disabilities can authenticate themselves. A short-term security benefit is not worth threatening a person's autonomy by denying them access to such important data as their finances.


[1] http://recaptcha.net/whyrecaptcha.html

[2] http://www.captcha.net/

[3] http://en.wikipedia.org/wiki/Captcha

[4] http://googleblog.blogspot.com/2006/11/audio-captchas-when-visual-images-are.html

[5] http://news.cnet.com/8301-17939_109-10222514-2.html

[6] http://www.answers.com/topic/captcha-1



Bots…………….Web robots, www robots, bots are software applications that run automated tasks over the Internet.


Internet…………..The Internet is a global network of interconnected computers, enabling users to share information along multiple channels.


ReCAPTCHA…….An application which harnesses users’ responses in CAPTCHA fields to verify the contents of a scanned piece of paper



CAPTCHA……… Completely Automated Public Turing test to tell Computers and Humans



OCR………………Optical Character Recognition


URL………………Uniform Resource Locator


Text Widget

Copyright © Vinay's Blog | Powered by Blogger

Design by | Blogger Theme by