Improving the ability of archival crawlers to capture your website will also tend to improve discovery of your content by search engine crawlers, enhance website performance by decreasing the load caused by robots and other clients, and save storage space.
Make links transparent
Represent web app states with links
The breadth and complexity of a modern web application is often belied by the paucity of unique web addresses that it presents in the course of its use. Building the application in such a way that distinct states are represented by distinct and fixed web addresses permits users with a shared link to bypass arbitrary interactions to get to the desired destination, facilitates more precise citation, and provides a more granular target for annotation. Critically, it also makes the website more accessible to both search engine and archival crawlers.
Use one link for each resource
Every web resource is available through at least one web address. For archiving, it is additionally preferable that every web resource be available through no more than one web address. Archival crawlers often de-duplicate captured content based on a combination of web address and checksum. When either of those values varies from what was recorded in a previous crawl, the resource is considered new. Some content management systems allow for the same resource to be served using different web addresses, which will result in superfluous requests from crawlers and increased archival storage requirements.
Be careful with robots directives
You may already use the robots exclusion standard to convey machine-readable preferences to search engine crawlers. Most web archiving initiatives obey these instructions , at least conditionally. Directives that have historically been appropriate for search engine crawlers - e.g., excluding directories containing scripts and style and layout instructions - are becoming less so. These exclusions have long been problematic in the archiving context, as they may prevent the capture of assets that are essential to faithfully re-presenting the archived website.
Aside from not discouraging the crawler from visiting vital resources, the robots exclusion standard can be affirmatively employed to improve archiving efforts. Use a site-level robots.txt file to link to an XML sitemap or specify a sustainable crawler request interval. Liberally ward crawlers away from website sections that may programmatically generate an arbitrary number of links using a site-level robots.txt file, a page-level <meta> tag, or rel="nofollow" link attributes.
Mind content license terms
Return reliable response codes
Implement caching enhancements
Web clients including archival crawlers take advantage of various HTTP response headers to minimize requests for content that hasn't changed since it was last cached: Content-Length, Last-Modified, and ETag. Research suggests that server responses related to caching are not always reliable , yet the correct implementation of these HTTP headers will reduce superfluous requests from all types of clients.
Minimize reliance on external assets necessary for presentation
Serve reusable assets from a common location
The key rationale for hosting some resources externally - performance - should also motivate serving reusable local assets from a single location. Content management systems sometimes instantiate each new sub-site with its own complement of the standard theme assets. Storing these in a common location referenced by each of the sub-sites allows for more efficient client caching, simultaneously improving website performance and archivability.