Converting text to basic latin (aka removing accents) with JavaScript

I was recently working on a decision engine for a quiz site (KwizMi.com) . The quizzes allow you to define a table of questions and answers, and users playing the quizzes have to try and guess all the answers in a given period of time.

The problem I stumbled upon is best illustrated by the following example:

  • Question: Which "GP" plays at centre-back for Barcelona?
  • Answer: Gerard [Piqué]

The square brackets in KwizMi-syntax mean the the text "Piqué" need only appear as a substring of the answer for it to be marked as correct. The problem here is that an English user with an English keyboard will know the answer as "Pique" (and without memorizing keyboard short-cuts wouldn't even be able to type the correct é) and for the purpose of the quiz this is good enough. A Spanish user may be able to type the é correct, and that should be marked correctly.

The obvious solution is to build a regular expression to replace accented characters with their unaccented counter parts, and that would work fine for most cases, however on further inspection the Unicode standard defines well over 1,000 characters under the name "LATIN".

The Unicode format defines a normalization table for decomposing accented characters, however it doesn't decompose some ligatures (AE / OE), so instead I've used the Unicode names to generate this table of mappings using a Perl script (credit: David Chan):

var latin_map = {
  'Á': 'A', // LATIN CAPITAL LETTER A WITH ACUTE
  'Ă': 'A', // LATIN CAPITAL LETTER A WITH BREVE
...
  'ᵥ': 'v', // LATIN SUBSCRIPT SMALL LETTER V
  'ₓ': 'x', // LATIN SUBSCRIPT SMALL LETTER X
};

Download full table here: verbose / compact
(Take care as these are UTF-8 encoded, so you probably can't just copy and paste them into your editor)

And the following extensions to the String object:

String.prototype.latinise = function() {
	return this.replace(/[^A-Za-z0-9]/g, function(x) { return latin_map[x] || x; })
};
 
// American English spelling :)
String.prototype.latinize = String.prototype.latinise;
 
String.prototype.isLatin = function() {
	return this == this.latinise();
};

Here are some examples:

> "Piqué".latinise();
"Pique"
> "Piqué".isLatin();
false
> "Pique".isLatin();
true
> "Piqué".latinise().isLatin();
true

All scripts are Public Domain.

AttachmentSize
Plain text icon latinise.js (Source)54.18 KB
Plain text icon latinise.min.js (Minified)7.98 KB
Plain text icon latin_map.pl37.01 KB

Comments

Gracias! muy buena solución!

'ß' : 'ss', // was missing..

Good point.

The Verbose version of the file has got => rather than : against the added ß

Excellent script, thank you.

Well done. I stumbled into this by way of StackOverflow while searching for a way to test, with Javascript, if strings are "equal" in the same sense as MySQL, where SELECT "José López" = 'jose lopez' returns true, and would give you a duplicate entry error if you had unique key constraints on your given_names and surnames field, and tried to insert the one string when the other already exists. Loading the whole Unicode table into memory seems like a bit much, but hitting the server via xhr to test it seems a bit much too. As often happens, there's no free lunch or perfect solution.

Thanks again.

It's a small subset of the Unicode table. In terms of memory it's a few hundred bytes, so less than a JPEG thumbnail and not much of an issue.

Haha I'm working on my own quiz site and have exactly the same problem.

What is the license for the code? Please consider publishing it on Github under an MIT license.

It's Public Domain as stated above.