Table of Contents
If you're translating your website to other languages, you'll likely want those pages to rank on global search engines too. This guide will walk you through a few things you'll need to consider regarding URL structures and publishing.
URL structure
Implementing a multi-language SEO strategy takes a significant amount of time and effort. Changing your URL structure after the initial implementation requires more effort and may negatively affect your current rankings. So it's important to pick a URL structure that can be kept in place for quite some time.
A quick overview of codes
Be sure to understand the language and country locale codes you will use on your site. Many people use these identifiers interchangeably, but to ensure strong performance, care should be taken.
Locale language codes refer to the common identifiers used to designate languages. A few specifications for this can be found in the footnotes. But generally, they are a 2 letter code, sometimes with a regional designation. For example, 'ja' is the code for the Japanese language, while 'jp' refers to Japan as a country.
Country codes are more standardized from a web perspective since Internet registrars work on a country-by-country basis. The same codes used in top-level domain suffixes are often used as a subdomain prefix for localized instances of a website.
Common approaches to URL structure
There are three common ways you can structure URLs for your multilingual website.
Top-Level Domains
Top-level domains are the least flexible and the most costly from a domain registration perspective. However, they are often considered the easiest to implement. Because you are registering a whole new domain, the locale code you use will need to be the same set by Internet registrars.
Since this will be country-specific, you will need to make decisions as to what language should be set as the default on the site for countries that may have a number of language choices.
Subdomains
Subdomains give greater flexibility because you won't need to register a new domain. However, you'll still need to modify your domain's DNS records in order to make the subdomain available.
Additionally, you can use the locale designation for language instead of a country identifier. This allows for multiple languages with different region codes and not just the country code.
Subdirectories
Subdirectories offer even more flexibility over subdomains. A subdirectory simply allows you to add the locale code within the URL path of your site. Similar to subdomains, these can be either country specific, language specific, or a mixture of both.
Also, with subdirectories you don't have to modify your DNS records, so it's a great solution for some hosting solutions where your control over the site is limited.
Measurement and tracking
Another concern is how you are going to measure the traffic to your localized sites and pages.
One common way to measure traffic based on different locales is to use country information. This information can be determined in a number of ways. First, in most cases, the browser is aware of the country. Often this is set when it is first setup/installed. The downside to this metric is that it doesn't tell us where the visitor is currently. Therefore many businesses choose to augment this data with GeoIP lookups. However, these lookups can still have some amount of inaccurate data.
Measuring traffic based on the country is very easy and will often be a part of your analytics package. Additionally, designing a URL structure that compliments this approach is simple and straightforward. However, due to the nature of using countries, your accuracy and confidence in the results might suffer.
Another way to measure your traffic is to use language. Language tends to be a more accurate approach since users tend only to use languages they understand. Since this is a setting that directly affects how the browser works, it's a safer assumption that most visitors have their language settings 'correct'. (Remember that 'correctness' is relative to the user). Most analytics packages also include statistics based on language as a default.
A more complex approach to international traffic measurement is to use a hybrid approach of using language AND country. In this approach, measurement is done for traffic where there is a correlation between the data points (IE if a visitor's browser is set to US and English...and GeoIP shows they are likely in the US). While this approach raises confidence in our measurements, we might also be ignoring important data points.
User experience
Another consideration when designing your internationalized structure is the experience your users see. If a user's browser is set to a specific language, they should see your site in that language. While this seems straightforward, it's highly dependent on how your URL structure.
The most straightforward solution is to detect the language browser setting and then redirect the visitor to a predefined URL. However, we still need to give them the ability to choose a different language themselves. So we must also implement a UI language picker. In the case where we are using asynchronous JavaScript to do our translations, we can change the language immediately without needing a reload action from the browser. If instead, we are using the visitors' country to control the user experience, our user experience is much trickier and raises quite a few questions.
Does it make sense to redirect a user immediately to another page?
How do we match the country to the language?
Is this type of functionality desirable? (an example where this might be true is in cases of e-commerce sites where visitors cannot place international orders)
Answers to these questions can really help to guide our decisions around URL structure.
Publish your languages
Besides making it clear to users how localized content is organized, we need to tell search engines as well.
Using sitemaps
The easiest way of identifying for indexers where your localized pages reside is to use a sitemap. Sitemaps list the link hierarchy of your website, so they can be used to indicate your localized pages. Sitemaps are easy to generate since many CMS systems will build them automatically for you. If you have a static site, there are also desktop tools that aid in building and maintaining them.
Using HREFLANGs
Another approach is to use HREFLANG tags in the actual pages to tell the indexers where our localized pages can be found. It's important that all localized pages are listed because an indexer could reach our site through an inbound link to a localized page. For static sites adding these tags is very easy because the languages and URLs do not change. However, is your site is hosted on a content management system this can be more challenging, especially if it does not support internationalization.
If you are in a case where you are using a content management system with no internationalization support, there are a couple ways to address this issue. The first approach is to store the hreflang tags as data within the CMS on each page. This type of approach allows a great deal of flexibility since URLs can be altered on an individual level. The downside with this approach is that this set of tags would need to managed.
A better, more automated solution can exist if your content management solution is also handling the set of rules regarding how localized URLs are built. If you have access to this set of rules, then all you need is to maintain a list of the language codes. Also, if you wish to avoid adding the source language in these tags, you will need to have a way to identify when the user is on a 'source' page.
Using HTTP headers
The last option is to use HTTP headers to provide information on the Hreflang. This approach is the trickiest and the least reliable. Server response headers are often manipulated by downstream network filters. If you are taking this approach, be sure to test your setup with search engine bots.
Testing your implementation
Don't just implement a solution and hope it works. Test it out. Most search engines provide an interface that lets you make sure your localized pages have been properly detected. And if you submitted a sitemap, the search engine should warn you of any errors.
Advanced integrations
The last aspect of international SEO involves advanced integrations that enable the translated language to be detected by search engines and other online content indexers.
Content management without internationalization
Often people find themselves using a content management solution that does not offer internationalized support. Without this being built into the content management system, often people will resort to using 'hack' solutions. These solutions often replicate the translated content in some way, often as multiple posts, or even sites. This is one case where a Javascript based solution like Transifex Live is very helpful. Since translations are stored remotely, no 'hacking' is required.
Javascript and SEO concerns
There is quite a bit of evidence that the Googlebot specifically does a good job indexing Javascript content. And even Google seems to indicated this is being done.
However, if you want to make sure you have support for all other indexers as well as various other bots, then there is really only a single option currently that ensures all of your content can be indexed across all bots and indexers. This solution uses the 'prerendered' approach to display all content as 'source' content. This is the most direct approach because it does not rely on any specific indexer technology. The goal here is to represent source content to the browser in the way you would want the search engine to see it. In the case where your content is coming from Javascript, you will need a separate process that 'prerenders' the Javascript and produces static HTML. In this case the static HTML is served directly to the search engine and it does not need to be aware of the Javascript implementation.
Please note that a previous specification for indexing Javascript which is often referred to as 'escaped_fragment' (using the '!#' in your URL), was part of a Google specification for Ajax crawling. Google has officially deprecated it as of October 2015. So you should choose a different approach for a new sites.
Further reading