Sunday, October 26, 2014

Encoding / decoding UTF8 in javascript

From time to time it has somewhat annoyed me that UTF8 (today's most common Unicode transport encoding, recommended by the IETF) conversion is not readily available in browser javascript. It really is, though, I realized today:
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}
2012 Update: Monsur Hossain took a moment to explain how and why this works. It's a good, in-depth post citing all standards in play so you need not bring a wizard's beard to know why it works. Executive summary: escape and unescape operate solely on octets, encoding/decoding %XXs only, while encodeURIComponent and decodeURIComponentencode/decode to/from UTF-8, and in addition encode/decode %XXs. The hack above combines both tools to cancel out all but the UTF-8 encoding/decoding parts, which happen inside the heavily optimized browser native code, instead of you pulling the weight in javascript. 

Tested and working like a charm in these browsers:
Win32
  • Firefox 1.5.0.6
  • Firefox 1.5.0.4
  • Internet Explorer 6.0.2900.2180
  • Opera 9.0.8502
MacOS
  • Camino 2006061318 (1.0.2)
  • Firefox 1.5.0.4
  • Safari 2.0.4 (419.3)
Any modern standards compliant browser should handle this code, though, so don't worry that it's a rather sparse test matrix. But feel free to use my test case: encoding anddecoding the word "räksmörgås". That's incidentally Swedish for a shrimp sandwich, by the way -- very good subject matter indeed.

And if you hand me your platform/browser combination and the its success/failure status for the tests, I'll try to update the post accordingly.
Categories:

No comments:

Post a Comment