Internationalization Guide for Java Web Applications

One World, One Character Set

Table of Contents

Introduction
Changes
Foreword
Editing Files
Editing .properties Files
Default File Encoding
Compiling Source Code
Configuring Application Server
Configuring the Web Server
Servlets
JSP Files
HTML
JavaScript
Cookies
HTTP GET
HTTP POST (aka HTML Forms)
Database Server
JDBC Driver
Collation
Determining language
Download
Tools
Success Stories
Troubleshooting
Applications
Future Plans / TODO
Feedback

Changes

Version 1.16 20120823: Revised version. PostgreSQL 9.1 collation support, mm... sweet!
Version 1.15.1 20100809: Revised version. Fixed a typo (missing parenthesis) from example code. Thanks Vesa.
Version 1.15 20100128: Revised version. Added ToProperties.java
Version 1.14 20091002: Revised version. Added clarification to jsp section. Requested by Brian.
Version 1.13 20071118: Revised version. Quick note: Apparently not all characters are allowed in the name of the cookie.
Version 1.12 20071022: Revised version. Added a success story by Sven S.
Version 1.11 20070924: Revised version. Moved things around. Added cookie UTF-8 test to Debug application. Added Language -section
Version 1.10 20070919: Revised version. Added Debug application. Inspired by Nelson Antony. Thanks to Oskari Vuori for testing help. He is a great guy despite of his Microsoft connections.
Version 1.9 20070919: Revised version. Added JavaScript charset, changed 'language' attribute to 'type' (deprecated in HTML 4.01)
Version 1.8 20070919: Revised version. Fixed minor things. Expect for more updates in near future! (Thanks guys for the feedback!)
Version 1.7 20050829: Revised version. Updated collation section (thanks to Kuisma Lehtonen for MS SQL Server info)
Version 1.6 20050828: Revised version. Added collation section (inspired by mohsen)
Version 1.5 20050825: Revised version. Added explanation of Unicode/UTF-8 (req by Mike Miller)
Version 1.4 20050824: Revised version. Janne Hietala suggested adding 'Troubleshooting' -section
Version 1.3 20050824: Revised version. Contributors: Tomi Panula-Ontto (Added a section for Web Server configuration)
Version 1.2 20050824: Revised version. Contributors: Tomi Panula-Ontto (JavaScript info, revised the layout [Thanks to Nifty Corners])
Version 1.1 20050823: Revised version. Contributors: Vesa Hiltunen (reminded of servlets, request and response handling), Tomi Panula-Ontto (HTTP GET)
Version 1.0 20050823: Initial version. Contributors: Tomi Panula-Ontto

Foreword

I've spent enough time solving internationalization problems that can be very time consuming bugs to track down. If I could help you out, great, but even better if you got something more to share. Projects come and go and every project has their own problems. Please send me more information on the subject! Also, send success stories if this FAQ could help you out.

If anyone has good material you'd like to share, please let me know and I'll add them to this document. Questions welcome, too!


Introduction

This short Guide tries to cover all the details required to write a web applications that are capable of handling Unicode (UTF-8) character set in every step back and forth. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. I won't go the benefits throughly, since the Unicode Consortium has better answers than I could never produce.


	Unicode provides a unique number for every character,
	no matter what the platform,
	no matter what the program,
	no matter what the language.

You should know this is not theoretical document. This is "On your knees, codemonkey. Hands into the dirt! NOW!" -type of a document. You may also be interested in the official Java Internationalization FAQ or maybe this Wikipedia article. This document started as a technical point of view to internationalization (aka ensuring one character set everywhere) but I think it has to tell more about the internationalization from the user point of view - aka localization (like collation, timezones, currencies,...) only time and feedback will tell..

Typical data flow in the web application:

	Browser <=> Web Server <=> Application Server <=> Your Application <=> JDBC Driver <=> Database

If you look at the data flow, you'll soon realize that one glitch somewhere in the process ruins the whole process. This Document aims to help YOU write (web) applications that work flawlessly when it comes to the character encoding issues.

I am using Resin application server by Caucho Technology in the tips, but I hope to support other application servers as well in the future. User feedback is more than welcome!

This material is copyrighted material of the author and all the contributors. All the credit to those who have contributed. Please do not copy this document, link it instead. I'm aware that this document may have inaccuracies. If you take a copy of this I will never be able to fix the the copied document. I recommend you report all inaccuracies to me.

Editing Files

You are propably using some text editor to write your code, xml configuration files, etc. I suggest you find a decent text editor which supports UTF-8 and from then on write all files in that encoding. Accept no other encoding from anyone. Make sure your text editor also reads and writes the UTF-8 files correctly. Check out the tools section.

Editing .properties Files

You shouldn't edit property files with normal text editor. Why's that? Because in the .properties files the text is encoded in the unicode (UCS-2) escapes. Here is an example:

	key=This is a sample key.\u00F6\u00E4\u00E5

If your text editor can not write these files correctly, then I suggest you changing your text editor. Check out the tools section.

PROPERTIES CONVERTER!

I got tired of trusting text editors. Even Eclipse doesn't always store properties files in correct format. Maybe it's just a configuration error on my side or whatever, but I don't have time to care, so I wrote a tool which enables me to write my properties file with tools in hand and then convert it to legitimate file. See Download section.

Default File Encoding

VERY IMPORTANT!

Remember that this will also affect to System.out and System.err, so if you are doing any output using them you might get incorrect results.

When Java application opens the file it assumes the file is encoded in the default file encoding and that depends on the platform under the JVM. When you start editing files in the UTF-8 you should also tell your application that all your files have UTF-8 encoding.

Here is how to do that:

	java -Dfile.encoding=UTF-8 MyGreatApp

Since you are using the application server I suggest you put that property setting somewhere in the startup script.

Compiling Source Code

You need to specify the encoding for your source files, because remember you are using UTF-8 and not the default encoding for the platform!

	javac -encoding UTF-8 MyPreciousSource.java

Configuring Application Server

Let's take a look at this snippet from the resin.conf config file. I have used bolding the important sections.

 1 <caucho.com>
 2        <java compiler='C:\j2sdk1.4.2_03\bin\javac.exe' args="-g -encoding UTF-8"/>
 3        <http-server app-dir='c:\workspace\myproject\web' class-update-interval='2'
 4                        character-encoding='UTF-8'>
 5                <classpath encoding='UTF-8'/>
 6                <jsp static-encoding='false' fast-jstl='false'/>
 7                <http port='80'/>
 8                <servlet-mapping url-pattern='/servlet/*' servlet-name='invoker'/>
 9                <servlet-mapping url-pattern='*.jsp' servlet-name='com.caucho.jsp.JspServlet'/>
10        </http-server>
11 </caucho.com>

Analysis

  1. Second line: Specify character encoding used by source files. Okay, this may be redundant since it's mentioned in the classpath config too.. but I'm getting paranoid on this.
  2. Take a look at to the 4th line. That defines that application server should use UTF-8 for example when reading parameters from the HTTP Request.
  3. 5th line defines the encoding for source code to compile in the classpath directory (auto compiling).
  4. 6th line disables any static encoding and fast-jstl. Fast-jstl implementation uses iso-8859-1 character set, so naturally we wish to disable such behaviour. The disabling of static encoding may or may not be useful. I'll have to verify this.

Configuring the Web Server

Check your web server, for example Apache, that it does not add or override any headers that might conflict with your application.

For example, the AddDefaultCharset directive in the Apache configuration is really an override. Not default. If you don't do this the Apache will replace Content-type -header with iso-8859-1 encoding which isn't really what you want. Now, be a good codemonkey, edit your httpd.conf or site specific config and make sure it has line which says:

	AddDefaultCharset Off

Servlets

You have to set the encoding for the request before reading request parameters or reading input using getReader(). Likewise you have to set the encoding for output before writing output using getWriter().

Here is an example:

import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class MyServlet extends HttpServlet {
	protected void doPost(HttpServletRequest request, HttpServletResponse response)
						throws ServletException,java.io.IOException {
		// set the encoding for the input parameters 
		request.setCharacterEncoding("UTF-8");
		String test = request.getParameter("test");
		if( test==null ) test = "input parameter 'test' was null";

		// set the content type AND encoding for the output
		ServletContext context = getServletContext();
		if( context.getMajorVersion()>=2 && context.getMinorVersion()>=4 ) {
			response.setContentType("text/html");
			response.setCharacterEncoding("UTF-8");
		} else {
			response.setContentType("text/html;charset=UTF-8");
		}
		PrintWriter out = response.getWriter();
		out.print( test );
	}
}

Thanks to Vesa Hiltunen for bringing this up. I forgot this from the initial version. Note that API has been slightly modified since Servlet API version 2.4

JSP Files

VERY IMPORTANT!

Never use <%@ include file="..." %> or <jsp:include page=".."/> to include same settings for every page because they don't work the way you think they do. Always have these settings in every JSP file.

Updated 2009-10-02: It seems that <%@include file="..." %> works ok now, but check the HTTP headers, just to make sure your appserver isn't playing you.

Let's take a look at to the JSP file fragment.

 1 <%@page 
 2        contentType="text/html; charset=UTF-8"
 3        pageEncoding="UTF-8" 
 4 %>

Analysis

  1. Second line defines the content type and encoding for the output.
  2. Third line defines the encoding for the JSP file, which naturally should be UTF-8

Updated 2009-10-02
Brian asked about my warning:
"Can you please explain why this doesn't work and also what the right way is to achieve this?"

Suppose you have two JSP files: page.jsp and header.jsp, your plan is to have configuration in one file so that you don't need to modify multiple files if anything changes. Let's assume that for some odd reason your application handles everything in iso-8859-1 charset, but since usually your projects have UTF-8 charset your server has been set to assume UTF-8 as default.

header.jsp:

 1 <%@page
 2       contentType="text/html; charset=iso-8859-1"
 3       pageEncoding="iso-8859-1"
 4 >

page.jsp:

 1 <jsp:include page="header.jsp" >
 2 ..some other stuff..

Our browser requests page.jsp as follows (simplified)

 1 GET /page.jsp HTTP/1.0
 2 Host: test.mydom.tld

You would expect to have iso-8859-1 charset in the Content-Type header since it was included to the page.jsp? Sorry, didn't work out for you. The response comes in UTF-8 as instructed by the default encoding instead of the expected iso-8859-1.

HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 02 Oct 2009 15:06:22 GMT
  Server: Apache
  Cache-Control: private
  Set-Cookie: JSESSIONID=abcFUCf5XWaXFhikzYvqs; path=/; HttpOnly
  Content-Type: text/html; charset=UTF-8
  Content-Length: 12
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive

Problem is that <jsp:include page=".." /> is internally a separate request and so is the response, too. Setting headers to that response will be discarded and won't be seen by the browser.

However it seems that <%@include file="header.jsp" %> does the trick. It did't when I originally wrote this document. I wonder if Caucho[Resin] fixed some bug related to this? Anyway, check out how your application server -really- behaves. Self awareness never killed anyone, ok? :)

HTML Pages

Let's look at the code once again.

 1	<html>
 2	<head>
 3         <title>My Precious Form</title>
 4         <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 5      </head>
 6	<body>
 7      ...
 8	</body>
 9	</html>

Analysis

  1. Look at the 4th line. This is necessary to tell the browser that we are sending HTML with UTF-8 encoding (keep in mind you have to write the document in UTF-8 encoding, too!). This is propably unnecessary as it is already done by the JSP, but I suggest you still put it there. The Browser may fail to recognize the HTTP header.

JavaScript

So you wish to do some JavaScript output? Can do!

...and as always, caveats exist. You have to encode string before using it. You must use JavaScript function decodeURIComponent() before using it, except if you are going to use it in the URL string. You may not use JavaScript function unescape() since it works with ascii characters only!

Example #1: in document.write()

  <script type="text/javascript">
   document.write( decodeURIComponent( "<%= JavaScriptUTF8Encoder.encode("some trashy scandinavian characters like öäå" ) %>" ) );
  </script>

Example #2: in the URL strings

  <script type="text/javascript">
   location.replace('http://www.google.com/search?q=' + "<%= java.net.URLEncoder.encode("testöäå", "UTF-8") %>");
  </script>

JavaScriptUTF8Encoder.java is really a slightly modified URLUTF8Encoder.java by the W3.org.

Updated <script> -tag now has 'charset' -attribute.

Example #3: script tag

  <script type="text/javascript" src="myprecious.js" charset="UTF-8"></script>

TODO: I have not yet tested if the 'charset' -attribute affects the behaviour of JavaScript functions such as unescape()

Cookies

Just few days ago (today: 24th Sep 2007) I started wondering whether you can put UTF-8 data into a cookie... and Yes, it certainly appears so, although I have not done very good testing on this one.

RFC 2965 says:

The VALUE is opaque to the user agent and may be anything the
      origin server chooses to send, possibly in a server-selected
      printable ASCII encoding.

However, make no mistake: The value for optional comment attribute of a cookie must have UTF-8 encoding.

I fooling around with Tomcat 6.0.14 when I accidentally tried to set a cookie which had scandinavian characters. Tomcat disagreed:

	java.lang.IllegalArgumentException: Cookie name "едц" is a reserved token

Further investigation show that not all characters are allowed in the cookie name.
Quoting RFC 2965

NAME=VALUE
      Required.  The name of the state information ("cookie") is NAME,
      and its value is VALUE.  NAMEs that begin with $ are reserved for
      other uses and must not be used by applications.

I tried to find a list of allowed characters for the name of the cookie, but couldn't find a clear answer to that. This isn't particularly important as developers usually choose simple names for cookies and thus, can avoid any problems whatsoever.

HTTP GET

If you are using HTTP GET from the form, you can follow the instructions for the HTTP POST, but be warned: GET has finite amount of data it can carry on within the URL. Some recommendations have suggested a limit of 4 kilobytes. That limit is surprisingly quickly passed when dealing with asian languages. I recommend you to use HTTP POST and use the HTTP GET only as links.

However, if you are to send data from the link, you have to encode the data yourself. Take a look at to the URLUTF8Encoder.java from the W3.org

Example:

	<a href="http://www.google.com/search?&q=test%C3%A5%C3%B6%C3%A4">My Search>/a>

HTTP POST

Let's take a look at the form

 1	<html>
 2	<head><title>My Precious Form</title></head>
 3	<body>
 4		<form method="post" accept-charset="UTF-8" method="..." 
 5				enctype="multipart/form-data">
 6		<input type="submit">	
 7		</form>
 8	</body>
 9	</html>

Analysis

  1. 4th line tells the browser to send any input in UTF-8 character set.
  2. enctype in the 5th line not only enables file uploads as well, but has better unicode handling. Well, that's at least what the people say on the net. I have not found this necessary. I'd like to know more of problems that have required this setting. Feedback anyone?

Database Server

First rule: Choose a database server that can handle UTF-8 or Unicode character encoding. Once I had to take part in a project which had to use MySQL version that had only iso-8859-1 encoding available and we had to support both iso-8859-1 data and KOI-8R character encodings. It was.. interesting. If I ever meet a project like that I'll shoot on sight.

I recommend using PostgreSQL server. It's versatile, fast, free and has excellent unicode support. However, you should know it lacks some of the collation features you might need in your multinational applications. Update: This is no longer the case. PostgreSQL 9.1 rocks!

	CREATE DATABASE myprecious WITH ENCODING = 'UNICODE';

JDBC Driver

In some database servers it's possible to have different character encodings for each connection. I have not seen such JDBC drivers in a while, but they use to exist. Reports anyone?

Collation

So, what is the collation? Collation is the assembly of written information into a standard order. In other words: sorting. What about that standard order? Well, that depends. Here is an example taken from the default Unicode Collation Algorithm:

	Swedish: z < ö 
	German: ö < z 

As you can see, since the ordering rules may be different in different locales (you do have users both in Sweden and German, don't you?) - so has to be the returned data. Your users may get frustrated when they don't find their data from the correct place in the user interface. How to cope with that? You should be able to tell the JDBC driver on a connection basis in what locale order each query should return rows. How do you tell this? Unfortunetaly for some databases you can not. For example in PostgreSQL changing collation order requires repeating initdb. In MySQL (4.1 and above) you can use COLLATE within the SQL statements. For example:

	SELECT name FROM customer ORDER BY name COLLATE utf8_german2_ci;

PostgreSQL 9.1:

	SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
or
	SELECT * FROM tbl1 ORDER BY mycolumn COLLATE "en_US"

In Microsoft SQL Server (and propably Sybase, too?) you have to set the collation for each database field. I think this is not good behaviour since it's the user who still expects the data in his/her own locale, not in the order what the developer has chosen. Can you set connection specific collation in SQL server? Reports please.

	CREATE TABLE [user].[table] (
	         [field1] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
	         [field2] [varchar] (35) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
	) ON [PRIMARY]
	GO

Determining language

How to determine language of the user? Luckily a viable solution exists: HTTP-header

	Accept-Language: en,fi;q=0.5

Read more from the RFC 2616. In Java, however, you don't have to interpret contents of the header by yourself - just use javax.servlet.ServletRequest.getLocale()

There are few caveats though...

Download

ToProperties is ad-hoc tool for converting your properties files into correct format.
Usage: java -jar ToProperties.jar sourcefile destfile.xml [encoding]
Where:
sourcefile has key=value pairs in any encoding of your choosing.
destinationfile.[properties|xml] is the destination file. It writes both .properties and .xml format depending on the suffix of the destination file.
encoding is optional parameter. Default encoding is UTF-8 (not platform encoding!)

Example:
java -jar ToProperties.jar source.properties destination.xml windows-1252
This one reads source.properties in windows-1252 encoding and writes to XML Properties file destination.xml.
Download:
Source: ToProperties.java
Binary: ToProperties.jar

Tools

Currently I'm using Eclipse and Jedit. Make sure you change the default encoding for created files!

Success Stories

Nelson Antony reported success... and new problems with the chinese Internet Explorer (Debug app was inspired by him, actually)
Sven Schuhmacher reported success:
Hello Mr. Tomi Panula-Ontto,

I was porting an web-dictionary, written in php, to java and was
faced with some encoding problems. With your helpful information i
could solve these problems.

Thanks a lot!

Regards

Sven Schuhmacher.
Report a success story! Did this work help you? Let us hear it!

Troubleshooting aka what's wrong with my application?

In this section I'd like to tell real life stories from the wonderful world character encoding problems.

Just recently I was working on the project which has 4 different languages (finnish, swedish, english and russian). Everything seemed to work correctly. The web application had administrative section (believe it or not ;-) for administrator of the site and another one for the users. The administrative section was running perfectly. Or so it seemed. The water was calm before the storm.

Then it hit. I ran into a situation where I had three different versions of the character 'ä' and by version I mean that one was correctly printed and two that were a result of encoding mess.. different kind of messes to be exact. I started checking everything one by one, and couldn't find a solution. I checked my own code. I checked all configurations. I tried different approaches to get the correct results. All approaches resulted only to a different kind of a mess. When one of the problem was cleared, another one broke. I was getting frustrated. I rushed on to solve the problem. After three days I found out that my problem wasn't because I had one setting somewhere wrong.. I had many little errors here and there. One problem compensated on the another. Unfortunetaly this had resulted into seemingly correct behaviour. A lot of data had been already typed in to the system and now we realized that all the data was indeed in the database. It just wasn't unicode at all.

What had happened? The browser sent data in UTF-8 strings to the application server, which thought that these are iso-8859-1 strings.. and converted them into unicode and stored them into the database. This would have been easy to spot, if there had not been another problem...

When data was read from the database and sent to the browser the app server wrote the data into the pages as if they were unicode data written into iso-8859-1 stream.. It did the reverse encoding and to browser it was told that this data is in UTF-8 but from the appserver point of view it was still in iso-8859-1. This works perfectly.. until some of the localized data came from the properties files and some were written into the JSP pages itself.. and the hell breaks loose. I had violated many of the rules mentioned in this document:

  1. I didn't set the encoding for the server
  2. I didn't set the encoding and content-type in every JSP file (and I actually included most of the directives into every JSP files)

This document is a result of my journey.

Please tell us your stories. Horror stories usually give a reason to start thinking twice.

Applications

Debug application is a simple application that tests browsers capability in sending out UTF8 characters. Please take your browsers there; it's logging the results, which will help me collect the data of browsers that do not support UTF-8 correctly. It's version 0.1, so treat it nice, okay?! :) Note: Browser must support <iframe> -tag
Sample application coming

Future Plans / TODO

Sample application coming
Localization
Java 5 XML Properties, Timezones, currency and text-orientation issues (req by Christopher Brown)
Do Cookies dream of an electric sh.. UTF-8? Done

Feedback

My email address is

Thanks!
Tomi Panula-Ontto