Introduction
This short Guide tries to cover all the details required to write a web applications that are capable of handling Unicode (UTF-8) character set in every step back and forth. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. I won't go the benefits throughly, since the Unicode Consortium has better answers than I could never produce.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
You should know this is not theoretical document. This is "On your knees, codemonkey. Hands into the dirt! NOW!" -type of a document. You may also be interested in the official Java Internationalization FAQ or maybe this Wikipedia article. This document started as a technical point of view to internationalization (aka ensuring one character set everywhere) but I think it has to tell more about the internationalization from the user point of view - aka localization (like collation, timezones, currencies,...) only time and feedback will tell..
Typical data flow in the web application:
Browser <=> Web Server <=> Application Server <=> Your Application <=> JDBC Driver <=> Database
If you look at the data flow, you'll soon realize that one glitch somewhere in the process ruins the whole process. This Document aims to help YOU write (web) applications that work flawlessly when it comes to the character encoding issues.
I am using Resin application server by Caucho Technology in the tips, but I hope to support other application servers as well in the future. User feedback is more than welcome!
This material is copyrighted material of the author and all the contributors. All the credit to those who have contributed. Please do not copy this document, link it instead. I'm aware that this document may have inaccuracies. If you take a copy of this I will never be able to fix the the copied document. I recommend you report all inaccuracies to me.
Editing Files
You are propably using some text editor to write your code, xml configuration files, etc. I suggest you find a decent text editor which supports UTF-8 and from then on write all files in that encoding. Accept no other encoding from anyone. Make sure your text editor also reads and writes the UTF-8 files correctly. Check out the tools section.
Editing .properties Files
You shouldn't edit property files with normal text editor. Why's that? Because in the .properties files the text is encoded in the unicode (UCS-2) escapes. Here is an example:
key=This is a sample key.\u00F6\u00E4\u00E5
If your text editor can not write these files correctly, then I suggest you changing your text editor. Check out the tools section.
PROPERTIES CONVERTER!
I got tired of trusting text editors. Even Eclipse doesn't always store properties files in correct format. Maybe it's just a configuration error on my side or whatever, but I don't have time to care, so I wrote a tool which enables me to write my properties file with tools in hand and then convert it to legitimate file. See Download section.
Default File Encoding
VERY IMPORTANT!
Remember that this will also affect to System.out and System.err, so if you are doing any output using them you might get incorrect results.
When Java application opens the file it assumes the file is encoded in the default file encoding and that depends on the platform under the JVM. When you start editing files in the UTF-8 you should also tell your application that all your files have UTF-8 encoding.
Here is how to do that:
java -Dfile.encoding=UTF-8 MyGreatApp
Since you are using the application server I suggest you put that property setting somewhere in the startup script.
Compiling Source Code
You need to specify the encoding for your source files, because remember you are using UTF-8 and not the default encoding for the platform!
javac -encoding UTF-8 MyPreciousSource.java
Configuring Application Server
Let's take a look at this snippet from the resin.conf
config file. I have used bolding the important sections.
1 <caucho.com> 2 <java compiler='C:\j2sdk1.4.2_03\bin\javac.exe' args="-g -encoding UTF-8"/> 3 <http-server app-dir='c:\workspace\myproject\web' class-update-interval='2' 4 character-encoding='UTF-8'> 5 <classpath encoding='UTF-8'/> 6 <jsp static-encoding='false' fast-jstl='false'/> 7 <http port='80'/> 8 <servlet-mapping url-pattern='/servlet/*' servlet-name='invoker'/> 9 <servlet-mapping url-pattern='*.jsp' servlet-name='com.caucho.jsp.JspServlet'/> 10 </http-server> 11 </caucho.com>
Analysis
- Second line: Specify character encoding used by source files. Okay, this may be redundant since it's mentioned in the classpath config too.. but I'm getting paranoid on this.
- Take a look at to the 4th line. That defines that application server should use UTF-8 for example when reading parameters from the HTTP Request.
- 5th line defines the encoding for source code to compile in the classpath directory (auto compiling).
- 6th line disables any static encoding and fast-jstl. Fast-jstl implementation uses iso-8859-1 character set, so naturally we wish to disable such behaviour. The disabling of static encoding may or may not be useful. I'll have to verify this.
Configuring the Web Server
Check your web server, for example Apache, that it does not add or override any headers that might conflict with your application.
For example, the AddDefaultCharset
directive in the Apache configuration is really an override. Not default.
If you don't do this the Apache will replace Content-type -header with iso-8859-1 encoding which isn't really what you want.
Now, be a good codemonkey, edit your httpd.conf
or site specific config and make sure it has line which says:
AddDefaultCharset Off
Servlets
You have to set the encoding for the request before reading request parameters or reading input using getReader(). Likewise you have to set the encoding for output before writing output using getWriter().
Here is an example:
import java.io.*; import javax.servlet.*; import javax.servlet.http.*; public class MyServlet extends HttpServlet { protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException,java.io.IOException { // set the encoding for the input parameters request.setCharacterEncoding("UTF-8"); String test = request.getParameter("test"); if( test==null ) test = "input parameter 'test' was null"; // set the content type AND encoding for the output ServletContext context = getServletContext(); if( context.getMajorVersion()>=2 && context.getMinorVersion()>=4 ) { response.setContentType("text/html"); response.setCharacterEncoding("UTF-8"); } else { response.setContentType("text/html;charset=UTF-8"); } PrintWriter out = response.getWriter(); out.print( test ); } }
Thanks to Vesa Hiltunen for bringing this up. I forgot this from the initial version. Note that API has been slightly modified since Servlet API version 2.4
JSP Files
VERY IMPORTANT!
Never use <%@ include file="..." %>
or <jsp:include page=".."/>
to include same settings for every page because they don't work the way you think they do.
Always have these settings in every JSP file.
Updated 2009-10-02: It seems that <%@include file="..." %>
works ok now, but check the HTTP headers, just to make sure your appserver isn't playing you.
Let's take a look at to the JSP file fragment.
1 <%@page 2 contentType="text/html; charset=UTF-8" 3 pageEncoding="UTF-8" 4 %>
Analysis
- Second line defines the content type and encoding for the output.
- Third line defines the encoding for the JSP file, which naturally should be UTF-8
Updated 2009-10-02
Brian asked about my warning:
"Can you please explain why this doesn't work and also what the right way is to achieve this?"
Suppose you have two JSP files: page.jsp and header.jsp, your plan is to have configuration in one file so that you don't need to modify multiple files if anything changes. Let's assume that for some odd reason your application handles everything in iso-8859-1 charset, but since usually your projects have UTF-8 charset your server has been set to assume UTF-8 as default.
header.jsp:
1 <%@page 2 contentType="text/html; charset=iso-8859-1" 3 pageEncoding="iso-8859-1" 4 >
page.jsp:
1 <jsp:include page="header.jsp" > 2 ..some other stuff..
Our browser requests page.jsp
as follows (simplified)
1 GET /page.jsp HTTP/1.0 2 Host: test.mydom.tld
You would expect to have iso-8859-1
charset in the Content-Type header since it was included to the page.jsp? Sorry, didn't work out for you.
The response comes in UTF-8 as instructed by the default encoding instead of the expected iso-8859-1
.
HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 02 Oct 2009 15:06:22 GMT Server: Apache Cache-Control: private Set-Cookie: JSESSIONID=abcFUCf5XWaXFhikzYvqs; path=/; HttpOnly Content-Type: text/html; charset=UTF-8 Content-Length: 12 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive
Problem is that <jsp:include page=".." />
is internally a separate request and so is the response, too. Setting headers to that response will be discarded and won't be seen by the browser.
However it seems that <%@include file="header.jsp" %>
does the trick. It did't when I originally wrote this document.
I wonder if Caucho[Resin] fixed some bug related to this? Anyway, check out how your application server -really- behaves.
Self awareness never killed anyone, ok? :)
HTML Pages
Let's look at the code once again.
1 <html> 2 <head> 3 <title>My Precious Form</title> 4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 5 </head> 6 <body> 7 ... 8 </body> 9 </html>
Analysis
- Look at the 4th line. This is necessary to tell the browser that we are sending HTML with UTF-8 encoding (keep in mind you have to write the document in UTF-8 encoding, too!). This is propably unnecessary as it is already done by the JSP, but I suggest you still put it there. The Browser may fail to recognize the HTTP header.
JavaScript
So you wish to do some JavaScript output? Can do!
...and as always, caveats exist. You have to encode string before using it. You must use JavaScript function decodeURIComponent() before using it, except if you are going to use it in the URL string. You may not use JavaScript function unescape() since it works with ascii characters only!
Example #1: in document.write()
<script type="text/javascript"> document.write( decodeURIComponent( "<%= JavaScriptUTF8Encoder.encode("some trashy scandinavian characters like öäå" ) %>" ) ); </script>
Example #2: in the URL strings
<script type="text/javascript"> location.replace('http://www.google.com/search?q=' + "<%= java.net.URLEncoder.encode("testöäå", "UTF-8") %>"); </script>
JavaScriptUTF8Encoder.java is really a slightly modified URLUTF8Encoder.java by the W3.org.
Updated <script> -tag now has 'charset' -attribute.
Example #3: script tag
<script type="text/javascript" src="myprecious.js" charset="UTF-8"></script>
TODO: I have not yet tested if the 'charset' -attribute affects the behaviour of JavaScript functions such as unescape()
Cookies
Just few days ago (today: 24th Sep 2007) I started wondering whether you can put UTF-8 data into a cookie... and Yes, it certainly appears so, although I have not done very good testing on this one.
RFC 2965 says:
The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding.
However, make no mistake: The value for optional comment attribute of a cookie must have UTF-8 encoding.
I fooling around with Tomcat 6.0.14 when I accidentally tried to set a cookie which had scandinavian characters. Tomcat disagreed:
java.lang.IllegalArgumentException: Cookie name "едц" is a reserved token
Further investigation show that not all characters are allowed in the cookie name.
Quoting RFC 2965
NAME=VALUE Required. The name of the state information ("cookie") is NAME, and its value is VALUE. NAMEs that begin with $ are reserved for other uses and must not be used by applications.
I tried to find a list of allowed characters for the name of the cookie, but couldn't find a clear answer to that. This isn't particularly important as developers usually choose simple names for cookies and thus, can avoid any problems whatsoever.
HTTP GET
If you are using HTTP GET
from the form, you
can follow the instructions for the HTTP POST
,
but be warned: GET
has finite amount of data it can carry on within the URL.
Some recommendations have suggested a limit of 4 kilobytes. That limit is surprisingly quickly passed when dealing with asian languages.
I recommend you to use HTTP POST
and use the HTTP GET
only as links.
However, if you are to send data from the link, you have to encode the data yourself. Take a look at to the URLUTF8Encoder.java from the W3.org
Example:
<a href="http://www.google.com/search?&q=test%C3%A5%C3%B6%C3%A4">My Search>/a>
HTTP POST
Let's take a look at the form
1 <html> 2 <head><title>My Precious Form</title></head> 3 <body> 4 <form method="post" accept-charset="UTF-8" method="..." 5 enctype="multipart/form-data"> 6 <input type="submit"> 7 </form> 8 </body> 9 </html>
Analysis
- 4th line tells the browser to send any input in UTF-8 character set.
enctype
in the 5th line not only enables file uploads as well, but has better unicode handling. Well, that's at least what the people say on the net. I have not found this necessary. I'd like to know more of problems that have required this setting. Feedback anyone?
Database Server
First rule: Choose a database server that can handle UTF-8 or Unicode character encoding.
Once I had to take part in a project which had to use MySQL version that had only
iso-8859-1
encoding available and we had to support both iso-8859-1
data and KOI-8R
character encodings. It was.. interesting.
If I ever meet a project like that I'll shoot on sight.
I recommend using PostgreSQL server. It's versatile, fast, free and has excellent unicode support.
However, you should know it lacks some of the collation features you might need in your multinational applications.
Update: This is no longer the case. PostgreSQL 9.1 rocks!
CREATE DATABASE myprecious WITH ENCODING = 'UNICODE';
JDBC Driver
In some database servers it's possible to have different character encodings for each connection. I have not seen such JDBC drivers in a while, but they use to exist. Reports anyone?
Collation
So, what is the collation? Collation is the assembly of written information into a standard order. In other words: sorting. What about that standard order? Well, that depends. Here is an example taken from the default Unicode Collation Algorithm:
Swedish: z < ö German: ö < z
As you can see, since the ordering rules may be different in different locales (you do have users both in Sweden and German, don't you?) - so has to be the returned data. Your users may get frustrated when they don't find their data from the correct place in the user interface.
How to cope with that? You should be able to tell the JDBC driver on a connection basis in what locale order each query should return rows.
How do you tell this? Unfortunetaly for some databases you can not. For example in PostgreSQL changing collation order requires repeating initdb
.
In MySQL (4.1 and above) you can use COLLATE
within the SQL statements. For example:
SELECT name FROM customer ORDER BY name COLLATE utf8_german2_ci;
PostgreSQL 9.1:
SELECT a < ('foo' COLLATE "fr_FR") FROM test1; or SELECT * FROM tbl1 ORDER BY mycolumn COLLATE "en_US"
In Microsoft SQL Server (and propably Sybase, too?) you have to set the collation for each database field. I think this is not good behaviour since it's the user who still expects the data in his/her own locale, not in the order what the developer has chosen. Can you set connection specific collation in SQL server? Reports please.
CREATE TABLE [user].[table] ( [field1] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , [field2] [varchar] (35) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ) ON [PRIMARY] GO
Determining language
How to determine language of the user? Luckily a viable solution exists: HTTP-header
Accept-Language: en,fi;q=0.5
Read more from the RFC 2616. In Java, however, you don't have to interpret contents of the header by yourself - just use javax.servlet.ServletRequest.getLocale()
There are few caveats though...
- Do users even know they have possibility to change the default language from their browser? Mostly, they just go with the default (usually english)
- It's acceptable to figure out the language only, everything else (like date, time, number formatting) is way too far reaching - despite of the language setting the user may very well a tourist at some information kiosk or someone else than the regular user of the computer
- Ask the user and preferably give a possibility to choose a language and other settings. I repeat this. Give a possibility to change settings at any time. I can't stress this enough. Just recently I used a site that remembered the setting and didn't give a possibility to change the setting later. Wasn't particularly funny when you are spending almost hundred euros at that site and you can't switch back to own language!
Download
ToProperties is ad-hoc tool for converting your properties files into correct format.
Usage: java -jar ToProperties.jar sourcefile destfile.xml [encoding]
Where:
sourcefile has key=value pairs in any encoding of your choosing.
destinationfile.[properties|xml] is the destination file. It writes both .properties and .xml format depending on the suffix of the destination file.
encoding is optional parameter. Default encoding is UTF-8 (not platform encoding!)
Example:
java -jar ToProperties.jar source.properties destination.xml windows-1252
This one reads source.properties in windows-1252 encoding and writes to XML Properties file destination.xml.
Download:
Source: ToProperties.java
Binary: ToProperties.jar
Tools
Currently I'm using Eclipse and Jedit. Make sure you change the default encoding for created files!
Success Stories
- Nelson Antony reported success... and new problems with the chinese Internet Explorer (Debug app was inspired by him, actually)
- Sven Schuhmacher reported success:
Hello Mr. Tomi Panula-Ontto, I was porting an web-dictionary, written in php, to java and was faced with some encoding problems. With your helpful information i could solve these problems. Thanks a lot! Regards Sven Schuhmacher.
Troubleshooting aka what's wrong with my application?
In this section I'd like to tell real life stories from the wonderful world character encoding problems.
Just recently I was working on the project which has 4 different languages (finnish, swedish, english and russian). Everything seemed to work correctly. The web application had administrative section (believe it or not ;-) for administrator of the site and another one for the users. The administrative section was running perfectly. Or so it seemed. The water was calm before the storm.
Then it hit. I ran into a situation where I had three different versions of the character 'ä' and by version I mean that one was correctly printed and two that were a result of encoding mess.. different kind of messes to be exact. I started checking everything one by one, and couldn't find a solution. I checked my own code. I checked all configurations. I tried different approaches to get the correct results. All approaches resulted only to a different kind of a mess. When one of the problem was cleared, another one broke. I was getting frustrated. I rushed on to solve the problem. After three days I found out that my problem wasn't because I had one setting somewhere wrong.. I had many little errors here and there. One problem compensated on the another. Unfortunetaly this had resulted into seemingly correct behaviour. A lot of data had been already typed in to the system and now we realized that all the data was indeed in the database. It just wasn't unicode at all.
What had happened? The browser sent data in
UTF-8
strings to the application server, which thought that these areiso-8859-1
strings.. and converted them into unicode and stored them into the database. This would have been easy to spot, if there had not been another problem...When data was read from the database and sent to the browser the app server wrote the data into the pages as if they were unicode data written into
iso-8859-1
stream.. It did the reverse encoding and to browser it was told that this data is inUTF-8
but from the appserver point of view it was still iniso-8859-1
. This works perfectly.. until some of the localized data came from the properties files and some were written into the JSP pages itself.. and the hell breaks loose. I had violated many of the rules mentioned in this document:
- I didn't set the encoding for the server
- I didn't set the encoding and content-type in every JSP file (and I actually included most of the directives into every JSP files)
This document is a result of my journey.
Please tell us your stories. Horror stories usually give a reason to start thinking twice.
Applications
- Debug application is a simple application that tests browsers capability in sending out UTF8 characters. Please take your browsers there; it's logging the results, which will help me collect the data of browsers that do not support UTF-8 correctly. It's version 0.1, so treat it nice, okay?! :) Note: Browser must support <iframe> -tag
- Sample application coming
Future Plans / TODO
- Sample application coming
- Localization
- Java 5 XML Properties, Timezones, currency and text-orientation issues (req by Christopher Brown)
Do Cookies dream of an electric sh.. UTF-8?Done
Links
Feedback
My email address is
Thanks!
Tomi Panula-Ontto