|
Hello World and nearly all the other examples found in popular PHP tutorials and references assume a restricted form of English for their "natural language" communications. But PHP is capable of more. With the right techniques, PHP effectively handles not just the occasional accented character found in English names and loanwords but the characters of the world's most common languages: German, Russian, Chinese, Japanese, and many more.
Run this small PHP program: Listing 1. Coding Russian output
$q = "Здрав". "ствуй". "те"; print html_entity_decode($q, ENT_NOQUOTES, 'UTF-8')."\n";
|
With any luck, the output you see will be Здравствуйте — Russian for "Hello" or "Greetings." Too often, dealing in PHP with characters other than those of the standard English alphabet has been a matter of luck and even mystery. Even though a great deal has been written on such subjects as character encoding, internationalization, etc., much of it has been wrong, or at least outdated, and most of the rest rather tied to a particular configuration of PHP. The aim of this article is to present only the basics of Unicode handling in PHP, but to do so with enough care and completeness to provide a firm foundation for any "international programmming" you need to do. There's a lot going on behind the scenes This apparently simple two-line program involves a great deal of context. First, I assume PHP V5. While it's possible to manage non-English characters with PHP V4, it generally involves nonstandard extensions, and is almost certainly a misplaced effort in 2007. PHP V6, on the other hand, is scheduled to solve so many character encoding problems as to supersede most of the techniques shown here. With PHP V6, it's hoped that Unicode strings will just work. Even with a standard modern installation of PHP V5, there's no guarantee you'll see the same output I do. During my development, I've come across a few browsers that don't appear to access Cyrillic fonts and, thus, represent the output as the Latin-transliterated Zdravstvujte, rather than Здравствуйте.
Code format The PHP source code in this article is designed to work for the great majority of developers. As much as possible, it applies to any standard PHP V5 installation. To maintain a focus on the essentials, the source code is presented without enclosing <?php and ?> boilerplate tags. Output is most often targeted for text/plain. If you prefer, think of Listing 1 as an abbreviation for Listing 2. Listing 2. Coding Russian output, with more complete tagging
<?php // The next two lines are necessary only for // unusual configurations, but can only help. mb_language('uni'); mb_internal_encoding('UTF-8');
$q = "Здрав". "ствуй". "те"; print "<html>". html_entity_decode($q, ENT_NOQUOTES, 'UTF-8'). "</html>"; ?> |
To concentrate on PHP V5 and modern standard browser installations encompasses the great majority of commercial situations. Nearly all the techniques described here apply with any configuration of php.ini, locale, font collection, etc. Suppose we have a consistent platform for our experiments, then — what do we do with it? The most basic cases include: - Display of a message (prompt, ...) in a language other than English
- Reception of user input from TEXTAREAs and TEXT INPUTs
- Storage of character data in files and databases and its retrieval
- Simple string operations
Let's see what's involved.
Two challenges There are a couple of immediate difficulties. To move past the limitations of the standard English alphabet, even to maintain the accented characters that occasionally turn up in well-formatted English ("Ramón," "Gödel," "apéritif"), the correct solution for our purposes is Unicode, encoded as UTF-8. Even if you've been introduced to Unicode (see Resources), it's a demanding subject, with complex specialized definitions, including "glyph," "code point," "abstract character," and many more. Development with Unicode has the same "bootstrapping" challenge common in network programming, except worse: Instead of needing to have a working server and client before results begin to look sensible, effective Unicode programming requires: - An "input method" — Almost certainly one which goes well beyond the characters reachable on your day-to-day keyboard
- An application or computing language that properly handles Unicode data
- Correctly installed fonts and other facilities to display in human-readable form the characters you've computed
If you practice much international work, you might find yourself investing in special keyboards, editors, fonts, etc., just to be able to see what you're doing. A second major difficulty of this sort of programming is that PHP is broken. More precisely, PHP was broken. It was not originally designed to handle data beyond ASCII. PHP V6 should fix these deficiencies and bring PHP to the level of such languages as Python, where strings transparently embed Unicode data. In the meantime, though, Unicode programming with PHP requires care and attention. Many online forums and the few PHP books that mention Unicode give advice that's useful only with uncommon extensions or provide code that works only for some configurations. That's one of the reasons this article began with Listing 1: html_entity_decode is widely installed correctly, and rarely overloaded. While the trick of representing Unicode data as HTML numerically expressed entities makes for clumsy source code, it's reliable and easy to synthesize from standard Unicode tables. The same output can even more compactly be coded as: $r = "Здравствуйте"; print "$r\n"; |
In this form, however, the source code itself is not seven- or even eight-bit "clean," and many editors, configuration management systems, and other development tools, are likely to mangle it. One of the consequences is the mystery mentioned above: Programs that appear to work or fail capriciously. Another variation worth a moment's consideration is this: $q = "Здрав". "ствуй". "те"; print html_entity_decode($q, ENT_NOQUOTES, 'UTF-8')."\n"; |
This is a valuable alternative to Listing 1 for those occasions when one is working with a Unicode character table expressed in hexadecimal, rather than decimal integers.
PHP capabilities For anything beyond the most straightforward Unicode manipulations, I rely on a couple of convenience functions, illustrated in Listing 3 and output in Listing 4. Listing 3. Converting between displayable UTF-8 and debuggable Unicode codes
function utf8_to_unicode_code($utf8_string) { $expanded = iconv("UTF-8", "UTF-32", $utf8_string); return unpack("L*", $expanded); } function unicode_code_to_utf8($unicode_list) { $result = ""; foreach($unicode_list as $key => $value) { $one_character = pack("L", $value); $result .= iconv("UTF-32", "UTF-8", $one_character); } return $result; } $q = "Здравс". "ствуй". "те"; $r = html_entity_decode($q, ENT_NOQUOTES, 'UTF-8'); $s = utf8_to_unicode_code($r); $t = unicode_code_to_utf8($s); print "$r\n"; print_r($s); print "$t\n"; |
Listing 4. Output from running Listing 3
Здравсствуйте Array ( [1] => 65279 [2] => 1047 [3] => 1076 [4] => 1088 [5] => 1072 [6] => 1074 [7] => 1089 [8] => 1089 [9] => 1090 [10] => 1074 [11] => 1091 [12] => 1081 [13] => 1090 [14] => 1077 ) Здравсствуйте
|
Notice that all the source code and everything printed apart from the Russian string is conventionally displayable and, in fact, seven-bit ASCII, so that it is easy to copy, e-mail, and otherwise process with typical development tools. Still another way to output the same Russian word is with: $l = array(1047, 1076, 1088, 1072, 1074, 1089, 1089, 1090, 1074, 1091, 1081, 1090, 1077); print unicode_code_to_utf8($l)."\n";
|
Notice that, as long as your data stay on one machine, it's legitimate to skip over the first integer value of 65279, the byte order marker (BOM). BOM is documented in Resources as an aspect of Unicode that's not specific to PHP and won't be mentioned further here. These are elementary manipulations, obvious to any experienced PHP programmer. It's important to make them explicit, though, because so much of what's already written about PHP is cryptic and nonportable. All other treatments of Unicode for PHP I've found reasonably treat PHP as an engine for pushing characters from one place to another. The emphasis is on passing Unicode through from keyboard to database to screen, so there's no need to examine how the strings look within PHP itself. That certainly streamlines code and the final forms of your production applications might never need HTML entities or UTF-32 conversions. I've found these low-level techniques invaluable, though, for all the times that programming does not go smoothly — when the database and your XML editor, for example, can't agree on an encoding, and the only overt evidence you have are entries that print as "????????" In such cases, it's a great help to work with individual characters in their various human-readable renditions.
Programming considerations As mentioned, it's possible to make PHP work with Unicode in several ways, including extensions to PHP, different encodings, etc. Unless you're expert, though, I recommend against trying to decide between these many possibilities. You're almost certain to achieve the best results if you focus on this single, consistent target: - Explicit use of UTF-8, marked with
- "mb_language('uni'); mb_internal_encoding('UTF-8');" at the top of your scripts
- Content-type: text/html; charset=utf-8 in the HTTP header, by way of .htaccess, header() or Web server configuration
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> and <orm accept-charset = "utf-8"> in HTML markup
- CREATE DATABASE ... DEFAULT CHARACTER SET utf8 COLLATE utf8 ... ENGINE ... CHARSET=utf8 COLLATE=utf8_unicode_ci is a typical sequence for a MySQL instance, with comparable expressions for other databases
- SET NAMES 'utf8' COLLATE 'utf8_unicode_ci' is a valuable directive for PHP to send MySQL immediately after connecting
- In php.ini, assign default_charset = UTF-8
- Replacement of string functions, such as strlen and strtlower, with mb_strlen and mb_convert_case
- Replacement of mail and colleagues with mb_send_mail, etc.; while Unicode-aware e-mail is an advanced topic beyond the scope of this introduction, the use of mb_send_mail is a good starting point
- Use of multibyte regular expressions functions (see Resources)
An ellipsis function I often use provides a small example of how to work with multibyte string functions. The original version of this function was close to: Listing 5. Conventional truncation
function ell_truncate($string, $permitted_length) { if (strlen($string) <= $permitted_length) return $string; $ellipsis = "..."; return substr_replace($string, $ellipsis, $permitted_length - strlen($ellipsis)); } |
Applied to a long explanation, with a length of 10, the result is a long ..., while increasing the length to 30 returns the original string. This is handy for quick abbreviation of titles, for example. Following is an illustration of a more Unicode-savvy solution. Listing 6. Better ellipses
function mb_ell_truncate($string, $permitted_length) { if (strlen($string) <= $permitted_length) return $string; $ellipsis = html_entity_decode("…", ENT_NOQUOTES, 'UTF-8'); return mb_substr($string, 0, $permitted_length - mb_strlen($ellipsis)). $ellipsis; } $q = "Здрав". "ствуй". "те"; $q = html_entity_decode($q, ENT_NOQUOTES, 'UTF-8'); print mb_ell_truncate($q, 8)."\n"; |
This uses a standard typography for the ellipsis and correctly counts the characters of the string to abbreviate in all combinations of PHP configuration. All these items constitute only a starting point for Unicode programming. Plenty of larger challenges remain, including: - Not all languages make the upper-case/lower-case distinction
- In many, "alphabetization" isn't meaningful, so sorting has a different interpretation from in English
- The same two characters might sort in different orders depending on the languages they're writing
- Security multiplies in complexity; what you see as "abc" might be completely different values from the usual English letters, which happen to be printed the same way
These issues are shared by most Unicode-capable computing languages. The point of this article is to ensure that you understand the fundamentals in sufficient depth to have confidence to attack more advanced topics. Remember: If you're having to work hard or do tricky coding in handling Unicode, you're probably doing something wrong. PHP V5 and the tips above are designed to make your Unicode programming simple. http://www.ibm.com/developerworks/library/os-php-unicode/index.html Views: 340
 Be first to comment this article | | |