Web Bot Example
Overview
ScriptQ was designed to handle large numbers of jobs and one of the most arduous jobs that ScriptQ was designed to tackle is bots. Derived from the word "robot," a bot is a software program that performs repetitive functions, such as indexing information on the Internet. Web bots are used to check for broken links, missing images, accessibility compliance, Web site inventory management and much more.
The broken link checker bot is written in VBScript and uses the HTTP component that ships with ScriptQ. To start crawling a Web site for broken links, download the VBScript bot. Open ScriptQ Monitor, select Tools > Add Job, paste the VBScript into the script field, modify the domain to crawl and submit the job as shown in the screenshot below.
The log file displays the HTTP status code for each link, the link URL, the referrer URL (the page where the link was found) and the link text (to help you find it on the page), as seen in the screenshot below. A status code of 200 means the link was processed OK. All other numbers indicate that there was a problem.
This table describes the most common HTTP status codes. For a complete list, see http://www.w3.org/Protocols/rfc2616/rfc2616.html
Most Common HTTP Status CodesCode | Name | Description |
---|
200 | OK | The request has succeeded. |
301 | Moved Permanently | The requested resource has been assigned a new permanent URL. |
302 | Found | The requested resource resides temporarily under a different URL. |
400 | Bad Request | The request could not be understood by the server due to malformed syntax. |
401 | Unauthorized | The request requires user authentication. |
403 | Forbidden | The server understood the request, but is refusing to fulfill it. |
404 | Not Found | The server has not found anything matching the request URL. |
408 | Request Timeout | The client did not produce a request within the time that the server was prepared to wait. |
500 | Internal Server Error | The server encountered an unexpected condition which prevented it from fulfilling the request. |
Inner Workings
When the first job is submitted, the script will use the hard-code URL to start crawling. The script fetches a Web page and records the status to the log file. If the page was retrieved OK (HTTP status code 200), the script then creates a new job for each link on the Web page and passes the URL to crawl as an argument to the job. If the link is pointing to a different domain, the script will process the link but not any child links.
Depending on the size of the Web site and the number of threads ScriptQ is working with, the number of jobs can be in the thousands. This is normal and ScriptQ was designed for exactly this type of demanding situation.
Customization
The script is provided as an example of how to use ScriptQ to build a Web bot. You may need to modify the script to meet your specific needs.
Known Limitations
The HTTP component that ships with ScriptQ does not support SSL so the link checker bot will skip over URLs with https://
in them.
If you want to crawl multiple domains at the same time, modify the script to store each processed URL in a database instead of Shared Memory. The broken link checker script is written in such a way that it removes everything from Shared Memory when a job is submitted without arguments.
FAQs
How Do I Stop The Crawl?
Stop the ScriptQ service and delete all the jobs from the "Queue" and "Pickup" folders.