Generating a Sitemap for your Rails sites

Posted on

A sitemap according to Wikipedia is "a list of pages of a web site accessible to crawlers or users." While they are completely optional, Google uses the sitemap on your site to learn about it's structure. This allows Google and other search engines to potentially increase crawling coverage.

SitemapGenerator

While you can build this yourself manually via XML Builder or handcrafting an XML file, I prefer using the sitemap_generator gem. The greatest benefit of using the gem is that it is built to adhere to the Sitemap 0.9 protocol. Not only does it handle regular links, but also supports news, videos, images, mobile and geo sitemaps. SitemapGenerator also provides Ruby on Rails integration out of the box.

To get started, add the following to your Gemfile:

gem 'sitemap_generator'

After running bundle, install it to your Ruby on Rails project via the following rake task:

bundle exec rake sitemap:install

Creating your Sitemap configuration

SitemapGenerator requires that you specify a configuration file in config/sitemap.rb. Here is a breakdown:

The search engines reading your sitemap need to know what website they are dealing with. Set default_host to your root website URL.

SitemapGenerator::Sitemap.default_host = 'http://www.yoursite.com'

SitemapGenerator comes with multiple adapters that will more than likely suit your needs. If you already have CarrierWave setup in your project, the SitemapGenerator::WaveAdapter uses your existing settings. If CarrierWave is not being used, you can always fallback to the SitemapGenerator::S3Adapter. Set your adapter through the adapter configuration setting.

SitemapGenerator::Sitemap.adapter = SitemapGenerator::WaveAdapter.new

Since we are hosting our sitemap remotely, we need to set sitemaps_host. An example of this would be "http://YOUR_BUCKET.s3.amazonaws.com/". I personally set this to an environment variable SITEMAP_HOST.

SitemapGenerator::Sitemap.sitemaps_host = ENV['SITEMAP_HOST']

Set public_path to tmp to write our sitemap files before uploading. This example assumes you are using Heroku.

SitemapGenerator::Sitemap.public_path = 'tmp/'

To specify a specific directory you would like your Sitemaps stored on, set sitemaps_path.

SitemapGenerator::Sitemap.sitemaps_path = 'sitemaps/'

Once setup, you will need to specify the structure of your site. The following example demonstrates a couple of options such as specifying the change frequency of a page and indicating when the page was last modified.

SitemapGenerator::Sitemap.create do
  add '/contact_us', 'changefreq': 'weekly'
  Article.find_each do |article|
    add article_path(article), lastmod: article.updated_at
  end
end

Finally, SitemapGenerator can ping search engines to indicate they should crawl the site again by calling .ping_search_engines.

Here is the completed config/sitemap.rb file:

# config/sitemap.rb
SitemapGenerator::Sitemap.default_host = 'http://www.yoursite.com'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::WaveAdapter.new
SitemapGenerator::Sitemap.sitemaps_host = ENV['SITEMAP_HOST']
SitemapGenerator::Sitemap.public_path = 'tmp/'
SitemapGenerator::Sitemap.sitemaps_path = 'sitemaps/'


SitemapGenerator::Sitemap.create do
  add '/contact_us', 'changefreq': 'weekly'
  Article.find_each do |article|
    add article_path(article), lastmod: article.updated_at
  end
end

SitemapGenerator::Sitemap.ping_search_engines

robots.txt

In your public/robots.txt, set Sitemap to the url of your remote sitemap endpoint:

Sitemap: https://YOUR_BUCKET.s3.amazonaws.com/sitemaps/sitemap_index.xml.gz

Schedule Refresh

Once a day during your slowest traffic period, trigger a refresh via the included rake task:

bundle exec rake sitemap:refresh

Comments

comments powered by Disqus