With the widespread use of content management and e-commerce platforms for building websites, it’s common for the same page content to be found via multiple, unique URLs. A standardized SEO practice for ensuring that only the preferred page URL is indexed by search engines is the use of canonical URLs. A canonical URL can be considered the “source of truth” URL. Ensuring that no matter which variation of the content URL is discovered by a web crawler, it will be pointed towards a single, representative URL to be indexed.
When creating your custom search engine with Swiftype, you’ll find that our web crawler, Swiftbot, supports and adheres to canonical URL tag configurations.
If you notice certain pages on your site are not being indexed, it could be an issue with the canonical URL configuration on that page. Here’s a general example:
Swiftbot visits the URL
https://www.example.com/automobiles/bmw/2016x3/, but sees that the canonical URL set for that page (and every other page on the site) is simply
Since the canonical URL is treated as the source of truth, Swiftbot follows the instruction literally and assumes that all the page content on the site are copies of the homepage URL
https://www.example.com/. The end result is that all pages on the site are seen as being intended for consolidation into a single search result.
A straight forward fix for this scenario would be to update the canonical URL tags in your site template to correctly reflect the current page URL, or to remove the canonical URL instruction altogether. Doing so will allow Swiftbot to index your site as intended.
A more nuanced variation of this same issue is if when visiting our example URL
https://www.example.com/automobiles/bmw/2016x3/ the canonical URL tag on that page is set to
https://www.example.com/automobiles/bmw/2016x3 (with no forward slash at the end).
However, when Swiftbot visits the defined canonical URL it encounters a
301 redirect from the web server pointing it back to the original URL visited,
Once back on this page, Swiftbot restarts the process of checking for canonical URL instructions and, as you can imagine, becomes trapped in a loop until the indexing attempt for that page eventually quits due to failure.
When this occurs, the page can not be indexed.