Issues to consider when localising software
Have you ever landed on a website where the target audience isn’t people from your country or culture? And how long did it take you to realise this? In all likelihood, you will probably have noticed it almost instantaneously because each region has its own ways of doing things.
A Spanish person will notice that a website wasn’t made by someone Hispanic if the month is capitalised in the date e.g. 10 Agosto 2020 vs 10 agosto 2020. Similarly, a British person will realise a website is American when they read a spelling that omits the letter ‘u’ where they would usually use one e.g. color vs colour. These are both examples of localisation.
Localisation, according to the Oxford English Dictionary, is the adaptation of something for a local audience or market. It’s a big part of everyday life for most people and is hugely important for companies and organisations alike. Localised content is more appealing to customers, and opens companies up to new markets.
Translation is closely related to localisation, and refers specifically to the transcription of content from one language to another. Localisation projects may involve translation, but it’s not always required, such as when localising English or Spanish sites for different English or Spanish-speaking markets.
Dates and Numbers
Localising websites can be a fascinating process, full of interesting edge-cases to consider. Dates are a good example of this: think about the date 10/12/09 – what does this pattern of numbers mean to you? If you’re American, you’ll probably read this as the October 12th 2009, but as Brits, we would see this as being 10th December. A Hungarian might consider it to mean 9th December 2010 (although they would use dots instead of slashes to separate years, months and days).
A naive site owner may attempt to circumvent this ambiguity by defaulting to written date formats, but this quickly becomes challenging when you realise that the “th” (Ordinal indicator) and month needs to be translated. For example, 10th December would translate to “10 dicembre” in Italian.
Translating month names is still an achievable problem, but things become more challenging still, once we start considering languages like Chinese, Korean or Vietnamese. For example, did you know that in Vietnamese, months are numbered rather than named? If it’s 10th August, it would be written ‘Ngày 10 tháng 8’, which literally translates to day 10 of month 8.
Luckily for developers, the Unicode organisation have collated a resource of localisation data for every locale on the planet. This Unicode CLDR Project (Common Locale Data Repository), contains the most common short, medium, long and full date formats for each locale, in either XML or JSON format.
The CLDR data also contains data around number and currency formatting. If you’re British, have you ever gone to a Supermarket in Germany and seen that something costs €1,99 instead of €1.99? This is because in Germany, a comma represents the distinction between whole numbers and decimals, or in this case, euros and cents. Number formatting is another case where localisation is absolutely necessary, especially considering that there are so many different formats used worldwide. Take the number one million, two hundred thirty four thousand, five hundred sixty seven, point eight nine. In British English we would write:
But as was mentioned earlier, in Germany it would be:
In French, you would see:
1 234 567,89
Even further afield, in Indian English, the number would be presented as:
Now, we can start to see how localisation is not only crucial in making content appealing, but also in making it comprehensible. A British person might misinterpret a number written in the format of Indian English if they aren’t familiar with the different numbering system.
CLDR data is a great resource for things like date and number formats, as well as translations of very standardised words, such as place names, time zones and date-related wording. For actual translation though, the content in a site or application needs to be made available to translators.
The three-step-translation process
Software translation can be seen as a three-step-process:
- Extracting strings from the codebase into a format that the translators can work with
This could be a manual process where developers manually extract translation keys into a structured file format, or the translatable strings (words) could be extracted from the code by a build process.
This is where a bilingual speaker actually does the translation – this is likely to be done via a dedicated translation interface, such as the Poedit desktop app. This step may also involve machine-translation tech to avoid needing real translators.
- Translation import
At this step, the translated strings are saved into a format the codebase can use, and imported back into the application.
For some open source projects, like WordPress, the actual translation process is outsourced to the community and anyone can contribute at https://translate.wordpress.org/. This can be a great way to get familiar with the challenges translators face (and if you’re not bilingual you can still translate WordPress into your local flavour of English!).
In English we have just one plural form:
Singular: Second, Minute, Hour
Everything else: Seconds, Minutes, Hours
Therefore, if you have the strings “You have 1 new message” and “You have 2 new messages”, then you may be tempted to set these up as two standard strings in your translation files, and switch depending on if the number of messages is equal to exactly 1.
This only really works for the Germanic and Romance languages though. Take Polish as an example:
Singular: sekunda, minuta, godzina
Ends in 2-4, excluding 12-14: sekundy, minuty, godziny (2, 3, 4, 22, 23, 24, 32, 33, 34…)
Everything else: sekund, minut, godzin
Chinese, Japanese and Korean go the other way and only have a single form.
Any developer will tell you that writing this logic from scratch would be a painful, error-prone process, and there are at least 20 unique plural forms to consider. Thankfully the popular gettext translation system (used by the Linux OS and WordPress) handles this by default.
Beware of concatenation
As programmers, we can also make lives easier for translators. Imagine a company wants to translate their website into the number of languages corresponding to the audiences they want to reach. This is expensive, and so there is a natural inclination to want to reduce the number of required translations as far as possible. Unfortunately, this doesn’t work in practice.
Imagine you are developing a new content management system (CMS), where you have a number of editable entities such as pages, posts menus and users. Each of these will need labels such as “Edit page”, “New page”, “delete page”.
Many developers would be tempted to setup these three phrases as 4 distinct strings:
This is great, until you consider that some languages have an added part of the language to account for: the gender of nouns. “New” has two translations into Spanish: “Nuevo” or “Nueva”. The root of the adjective ‘new’ in Spanish is nuev-. The subsequent ending is dependent on the gender of the noun, so in the case of “New page” it has to be feminine: nueva instead of nuevo.
Another example is Mandarin, where some adjectives trigger an additional particle, 的 de, between the adjective and the noun. For example, the phrase ‘strange pattern’ would translate to:
Qíguài de tú’àn
Languages are a challenge
These are just some ways of showing how languages are different and don’t translate directly or follow the same patterns all the time. Languages are hard, and it’s unreasonable to expect that developers or site owners will be able to localise a website or other platform perfectly for every locale without incredible cost. However, we firmly believe that we have a responsibility to take as much care with any localisation work as we take with any English software we work on. And, it’s a new challenge that keeps us on our toes. What’s not to like?