ScriptQ Documentation: Web Bot Example

Web Bot Example

Overview

ScriptQ was designed to handle large numbers of jobs and one of the most arduous jobs that ScriptQ was designed to tackle is bots. Derived from the word "robot," a bot is a software program that performs repetitive functions, such as indexing information on the Internet. Web bots are used to check for broken links, missing images, accessibility compliance, Web site inventory management and much more.

The broken link checker bot is written in VBScript and uses the HTTP component that ships with ScriptQ. To start crawling a Web site for broken links, download the VBScript bot. Open ScriptQ Monitor, select Tools > Add Job, paste the VBScript into the script field, modify the domain to crawl and submit the job as shown in the screenshot below.

Screenshot of broken link checker bot script.

The log file displays the HTTP status code for each link, the link URL, the referrer URL (the page where the link was found) and the link text (to help you find it on the page), as seen in the screenshot below. A status code of 200 means the link was processed OK. All other numbers indicate that there was a problem.

Screenshot of the log file containing results of a Web crawl.

This table describes the most common HTTP status codes. For a complete list, see http://www.w3.org/Protocols/rfc2616/rfc2616.html

**Most Common HTTP Status Codes**
Code	Name	Description
200	OK	The request has succeeded.
301	Moved Permanently	The requested resource has been assigned a new permanent URL.
302	Found	The requested resource resides temporarily under a different URL.
400	Bad Request	The request could not be understood by the server due to malformed syntax.
401	Unauthorized	The request requires user authentication.
403	Forbidden	The server understood the request, but is refusing to fulfill it.
404	Not Found	The server has not found anything matching the request URL.
408	Request Timeout	The client did not produce a request within the time that the server was prepared to wait.
500	Internal Server Error	The server encountered an unexpected condition which prevented it from fulfilling the request.

Inner Workings

When the first job is submitted, the script will use the hard-code URL to start crawling. The script fetches a Web page and records the status to the log file. If the page was retrieved OK (HTTP status code 200), the script then creates a new job for each link on the Web page and passes the URL to crawl as an argument to the job. If the link is pointing to a different domain, the script will process the link but not any child links.

Depending on the size of the Web site and the number of threads ScriptQ is working with, the number of jobs can be in the thousands. This is normal and ScriptQ was designed for exactly this type of demanding situation.

Customization

The script is provided as an example of how to use ScriptQ to build a Web bot. You may need to modify the script to meet your specific needs.

Known Limitations

The HTTP component that ships with ScriptQ does not support SSL so the link checker bot will skip over URLs with https:// in them.

If you want to crawl multiple domains at the same time, modify the script to store each processed URL in a database instead of Shared Memory. The broken link checker script is written in such a way that it removes everything from Shared Memory when a job is submitted without arguments.

FAQs

How Do I Stop The Crawl?

Stop the ScriptQ service and delete all the jobs from the "Queue" and "Pickup" folders.

The XHTML WYSIWYG Editor For Desktop & Web Applications

Web Bot Example

Overview

Inner Workings

Customization

Known Limitations

FAQs

How Do I Stop The Crawl?

Navigation

XStandard works for ...

Most popular FREE downloads

XStandard quick links

Content management articles

Articles & resources