there is a little auto-detect script for encodings which decides if it is necessary to utf8_encode or not. it can simply be modified to work with iso-8859-1 scripts, too, and decide if utf8_decode or not.
preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$s)
As preg_match is a bit tricky with bigger strings $s, let me share the fixed function called autoencode: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
utf8_encode
(PHP 4, PHP 5)
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
Description
string utf8_encode
( string $data
)
This function encodes the string data to UTF-8, and returns the encoded version. UTF-8 is a standard mechanism used by Unicode for encoding wide character values into a byte stream. UTF-8 is transparent to plain ASCII characters, is self-synchronized (meaning it is possible for a program to figure out where in the bytestream characters start) and can be used with normal string comparison functions for sorting and such. PHP encodes UTF-8 characters in up to four bytes, like this:
| bytes | bits | representation |
|---|---|---|
| 1 | 7 | 0bbbbbbb |
| 2 | 11 | 110bbbbb 10bbbbbb |
| 3 | 16 | 1110bbbb 10bbbbbb 10bbbbbb |
| 4 | 21 | 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb |
Each b represents a bit that can be used to store character data.
Parameters
- data
-
An ISO-8859-1 string.
Return Values
Returns the UTF-8 translation of data .
utf8_encode
rabby
28-Apr-2009 08:29
28-Apr-2009 08:29
bassam at saprinna dot com
28-Apr-2009 03:17
28-Apr-2009 03:17
you can convert any encode to utf and save it to mysql from this function :
<?php
function convert_charset($item)
{
if ($unserialize = unserialize($item))
{
foreach ($unserialize as $key => $value)
{
$unserialize[$key] = @iconv('windows-1256', 'UTF-8', $value);
}
$serialize = serialize($unserialize);
return $serialize;
}
else
{
return @iconv('windows-1256', 'UTF-8', $item);
}
}
?>
mrezair at azarbod dot com
23-Mar-2009 11:49
23-Mar-2009 11:49
I found this little function very useful in fixing strings that are not in utf-8 but need be converted
<?php
// Fixes the encoding to uf8
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
} // fixEncoding
?>
dan at birminghampr dot co dot uk
19-Mar-2009 06:31
19-Mar-2009 06:31
I use a function like this, rather than utf8_encode() alone, for fixing the encoding of unknown data, for example the contents of get_meta_tags():
<?php
function FixEncoding($x){
if(mb_detect_encoding($x)=='UTF-8'){
return $x;
}else{
return utf8_encode($x);
}
}
?>
rogeriogirodo at gmail dot com
19-Mar-2009 05:54
19-Mar-2009 05:54
This function may be useful do encode array keys and values [and checks first to see if it's already in UTF format]:
<?php
public static function to_utf8($in)
{
if (is_array($in)) {
foreach ($in as $key => $value) {
$out[to_utf8($key)] = to_utf8($value);
}
} elseif(is_string($in)) {
if(mb_detect_encoding($in) != "UTF-8")
return utf8_encode($in);
else
return $in;
} else {
return $in;
}
return $out;
}
?>
Hope this may help.
[NOTE BY danbrown AT php DOT net: Original function written by (cmyk777 AT gmail DOT com) on 28-JAN-09.]
Julio Cesar
20-Jan-2009 06:38
20-Jan-2009 06:38
With This Script you can convert a lot of files in
subfolders and convert to UTF8 without problems!
I thought about that when I was converting an eclipse
Project to UTF-8 and I loose all the Accentuation O.o
But with this script YOU WILL NOT! ;-)
I Make this based on Aidan Kehoe's Script and webmaster at
asylum-et dot com of http://www.php.net/scandir:
<?php
ini_set("implicit_flush", "on");
ini_set("max_execution_time", 0);
ini_set("register_argc_argv", "on");
ini_set("html_errors", "Off");
function cp1252_to_utf8($str) {
$cp1252_map = array ("\xc2\x80" => "\xe2\x82\xac",
"\xc2\x82" => "\xe2\x80\x9a",
"\xc2\x83" => "\xc6\x92",
"\xc2\x84" => "\xe2\x80\x9e",
"\xc2\x85" => "\xe2\x80\xa6",
"\xc2\x86" => "\xe2\x80\xa0",
"\xc2\x87" => "\xe2\x80\xa1",
"\xc2\x88" => "\xcb\x86",
"\xc2\x89" => "\xe2\x80\xb0",
"\xc2\x8a" => "\xc5\xa0",
"\xc2\x8b" => "\xe2\x80\xb9",
"\xc2\x8c" => "\xc5\x92",
"\xc2\x8e" => "\xc5\xbd",
"\xc2\x91" => "\xe2\x80\x98",
"\xc2\x92" => "\xe2\x80\x99",
"\xc2\x93" => "\xe2\x80\x9c",
"\xc2\x94" => "\xe2\x80\x9d",
"\xc2\x95" => "\xe2\x80\xa2",
"\xc2\x96" => "\xe2\x80\x93",
"\xc2\x97" => "\xe2\x80\x94",
"\xc2\x98" => "\xcb\x9c",
"\xc2\x99" => "\xe2\x84\xa2",
"\xc2\x9a" => "\xc5\xa1",
"\xc2\x9b" => "\xe2\x80\xba",
"\xc2\x9c" => "\xc5\x93",
"\xc2\x9e" => "\xc5\xbe",
"\xc2\x9f" => "\xc5\xb8"
);
return strtr ( utf8_encode ( $str ), $cp1252_map );
}
function rscandir($base="", &$data=array()) {
$array = array_diff(scandir($base), array(".", ".."));
foreach($array as $value) :
if (is_dir($base.$value)) :
//$data[] = $base.$value."/";
$data = rscandir($base.$value."/", $data);
elseif (is_file($base.$value) &&
!eregi(".jpg|.gif|.png|.ttf|.dataModel|.wsdlDataModel
|.project|.jsdtscope|.prefs|.name|.container|
.exe|.bat|.cmd|.src|.dll|.ini|.swf|.fla|.bmp\$",
$value)) : /* where you put the unwanted extensions */
echo "Converting to UTF8 " . $base.$value . "\r\n";
file_put_contents(
$base.$value,
cp1252_to_utf8(
file_get_contents($base.$value)));
endif;
endforeach;
return $data;
}
echo "Type a Folder (With a Slash in end): ";
$folder = trim(fgets(STDIN));
rscandir($folder);
?>
You can put this on windows Dir and put a Batch like this:
@echo off
php -n C:\windows\ConvertUTF8.php
pause
So you can convert your files from any where, just type on
Execute Command Like: ConvertFilesToUTF8
I think this will help everyone! Enjoy ;-)
P.s: I remove the comments becouse the wordwrap
bitseeker
22-Sep-2008 02:37
22-Sep-2008 02:37
...or just use this simple piece of code to check valid utf-8 string:
<?php
/**
* Returns true if $string is valid UTF-8 and false otherwise.
*
* @since 1.14
* @param [mixed] $string string to be tested
* @subpackage
*/
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
?>
hmdker at gmail dot com
24-Aug-2008 10:19
24-Aug-2008 10:19
Here's my is_utf8 function, to detect valid UTF-8 text.
<?php
function is_utf8($str) {
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for($i=0; $i<$len; $i++){
$c=ord($str[$i]);
if($c > 128){
if(($c >= 254)) return false;
elseif($c >= 252) $bits=6;
elseif($c >= 248) $bits=5;
elseif($c >= 240) $bits=4;
elseif($c >= 224) $bits=3;
elseif($c >= 192) $bits=2;
else return false;
if(($i+$bits) > $len) return false;
while($bits > 1){
$i++;
$b=ord($str[$i]);
if($b < 128 || $b > 191) return false;
$bits--;
}
}
}
return true;
}
?>
akam
30-Jun-2008 07:44
30-Jun-2008 07:44
<?php
// Author akam at akameng dot com
// Support 6 bit
function UTF_to_Unicode($input, $array=False) {
$bit1 = pow(64, 0);
$bit2 = pow(64, 1);
$bit3 = pow(64, 2);
$bit4 = pow(64, 3);
$bit5 = pow(64, 4);
$bit6 = pow(64, 5);
$value = '';
$val = array();
for($i=0; $i< strlen( $input ); $i++){
$ints = ord ( $input[$i] );
$z = ord ( $input[$i] );
$y = ord ( $input[$i+1] ) - 128;
$x = ord ( $input[$i+2] ) - 128;
$w = ord ( $input[$i+3] ) - 128;
$v = ord ( $input[$i+4] ) - 128;
$u = ord ( $input[$i+5] ) - 128;
if( $ints >= 0 && $ints <= 127 ){
// 1 bit
$value .= '&#'.($z * $bit1).';';
$val[] = $value;
}
if( $ints >= 192 && $ints <= 223 ){
// 2 bit
$value .= '&#'.(($z-192) * $bit2 + $y * $bit1).';';
$val[] = $value;
}
if( $ints >= 224 && $ints <= 239 ){
// 3 bit
$value .= '&#'.(($z-224) * $bit3 + $y * $bit2 + $x * $bit1).';';
$val[] = $value;
}
if( $ints >= 240 && $ints <= 247 ){
// 4 bit
$value .= '&#'.(($z-240) * $bit4 + $y * $bit3 +
$x * $bit2 + $w * $bit1).';';
$val[] = $value;
}
if( $ints >= 248 && $ints <= 251 ){
// 5 bit
$value .= '&#'.(($z-248) * $bit5 + $y * $bit4
+ $x * $bit3 + $w * $bit2 + $v * $bit1).';';
$val[] = $value;
}
if( $ints == 252 && $ints == 253 ){
// 6 bit
$value .= '&#'.(($z-252) * $bit6 + $y * $bit5
+ $x * $bit4 + $w * $bit3 + $v * $bit2 + $u * $bit1).';';
$val[] = $value;
}
if( $ints == 254 || $ints == 255 ){
echo 'Wrong Result!<br>';
}
}
if( $array === False ){
return $unicode = $value;
}
if($array === True ){
$val = str_replace('&#', '', $value);
$val = explode(';', $val);
$len = count($val);
unset($val[$len-1]);
return $unicode = $val;
}
}
function Unicode_to_UTF( $input, $array=TRUE){
$utf = '';
if(!is_array($input)){
$input = str_replace('&#', '', $input);
$input = explode(';', $input);
$len = count($input);
unset($input[$len-1]);
}
for($i=0; $i < count($input); $i++){
if ( $input[$i] <128 ){
$byte1 = $input[$i];
$utf .= chr($byte1);
}
if ( $input[$i] >=128 && $input[$i] <=2047 ){
$byte1 = 192 + (int)($input[$i] / 64);
$byte2 = 128 + ($input[$i] % 64);
$utf .= chr($byte1).chr($byte2);
}
if ( $input[$i] >=2048 && $input[$i] <=65535){
$byte1 = 224 + (int)($input[$i] / 4096);
$byte2 = 128 + ((int)($input[$i] / 64) % 64);
$byte3 = 128 + ($input[$i] % 64);
$utf .= chr($byte1).chr($byte2).chr($byte3);
}
if ( $input[$i] >=65536 && $input[$i] <=2097151){
$byte1 = 240 + (int)($input[$i] / 262144);
$byte2 = 128 + ((int)($input[$i] / 4096) % 64);
$byte3 = 128 + ((int)($input[$i] / 64) % 64);
$byte4 = 128 + ($input[$i] % 64);
$utf .= chr($byte1).chr($byte2).chr($byte3).
chr($byte4);
}
if ( $input[$i] >=2097152 && $input[$i] <=67108863){
$byte1 = 248 + (int)($input[$i] / 16777216);
$byte2 = 128 + ((int)($input[$i] / 262144) % 64);
$byte3 = 128 + ((int)($input[$i] / 4096) % 64);
$byte4 = 128 + ((int)($input[$i] / 64) % 64);
$byte5 = 128 + ($input[$i] % 64);
$utf .= chr($byte1).chr($byte2).chr($byte3).
chr($byte4).chr($byte5);
}
if ( $input[$i] >=67108864 && $input[$i] <=2147483647){
$byte1 = 252 + ($input[$i] / 1073741824);
$byte2 = 128 + (($input[$i] / 16777216) % 64);
$byte3 = 128 + (($input[$i] / 262144) % 64);
$byte4 = 128 + (($input[$i] / 4096) % 64);
$byte5 = 128 + (($input[$i] / 64) % 64);
$byte6 = 128 + ($input[$i] % 64);
$utf .= chr($byte1).chr($byte2).chr($byte3).
chr($byte4).chr($byte5).chr($byte6);
}
}
return $utf;
}
?>
www.tricinty.com
11-Jun-2008 04:13
11-Jun-2008 04:13
<?php
/**
* Encodes an ISO-8859-1 mixed variable to UTF-8 (PHP 4, PHP 5 compat)
* @param mixed $input An array, associative or simple
* @param boolean $encode_keys optional
* @return mixed ( utf-8 encoded $input)
*/
function utf8_encode_mix($input, $encode_keys=false)
{
if(is_array($input))
{
$result = array();
foreach($input as $k => $v)
{
$key = ($encode_keys)? utf8_encode($k) : $k;
$result[$key] = utf8_encode_mix( $v, $encode_keys);
}
}
else
{
$result = utf8_encode($input);
}
return $result;
}
?>
klein at buchung-24 dot de
04-Jun-2008 04:52
04-Jun-2008 04:52
IF you don´t use the function from ethan dot nelson at ltd dot org in a class, you´ll get an error, so please try
function utf_prepare(&$array)
{
foreach($array AS $key => &$value)
{
if (is_array($value))
{
utf_prepare($value);
} else
{
$value = utf8_encode($value);
}
}
}
www.qaiser.net
17-Apr-2008 08:26
17-Apr-2008 08:26
that isUTF8 function is a killer...
wouldn't something like
if ( preg_match( "~(\x00[\x80-\xff]|[\x00-\x07][\x00-\xff]~", $string ) ) { /* is utf */ };
be a lot more efficient? it doesn't take into account all the ranges, but it has to be a better method and a simple start since it'll quit on the first successful match. think of encoding and decoding a 1mb string--not good. i'm having to work with +20 meg xml files.
renardo13 at free dot fr
01-Apr-2008 06:26
01-Apr-2008 06:26
another nice way to implement an isUTF8 function ...
<?php
function isUTF8($string)
{
return (utf8_encode(utf8_decode($string)) == $string);
}
?>
tacchete at gmail dot com
13-Dec-2007 06:05
13-Dec-2007 06:05
Known problem with Byte Order Mark (BOM) and header() in pages of a site.
For example at sending headings or to a dynamic conclusion in other coding distinct from UTF-8 by means of XSLT (<xsl:output encoding="windows-1251"/>).
To clean all symbols BOM from the text of page:
1. exclude BOM from the main file;
2. write down function of a return call for the buffer
<?php
header('content-type: text/html; charset: utf-8');
ob_start('ob');
function ob($buffer)
{
return str_replace("\xef\xbb\xbf", '', $buffer);
}
?>
it will exclude BOM from a code of the connected files;
3. do not experience for BOM in connected files;
4. be pleased.
ethan dot nelson at ltd dot org
07-Nov-2007 07:11
07-Nov-2007 07:11
This does the same thing as some of the posts below (minus the keys), but I thought I'd share anyway cause it is slightly more elegant. Also, its a good example using references such that this could be used as a callback function.
function utf_prepare(&$array) {
foreach($array AS $key => &$value) {
if (is_array($value)) {
$this->utf_prepare($value);
} else {
$value = utf8_encode($value);
}
}
}
luka8088 at gmail dot com
22-Jun-2007 07:49
22-Jun-2007 07:49
simple HTML to UTF-8 conversion:
function html_to_utf8 ($data)
{
return preg_replace("/\\&\\#([0-9]{3,10})\\;/e", '_html_to_utf8("\\1")', $data);
}
function _html_to_utf8 ($data)
{
if ($data > 127)
{
$i = 5;
while (($i--) > 0)
{
if ($data != ($a = $data % ($p = pow(64, $i))))
{
$ret = chr(base_convert(str_pad(str_repeat(1, $i + 1), 8, "0"), 2, 10) + (($data - $a) / $p));
for ($i; $i > 0; $i--)
$ret .= chr(128 + ((($data % pow(64, $i)) - ($data % ($p = pow(64, $i - 1)))) / $p));
break;
}
}
}
else
$ret = "&#$data;";
return $ret;
}
Example:
echo html_to_utf8("a b č ć ž こ に ち わ ()[]{}!#$?* < >");
Output:
a b č ć ž こ に ち わ ()[]{}!#$?* < >
hillar dot petersen at gmail dot com
30-May-2007 11:29
30-May-2007 11:29
In addition to my previous post. If your values are already in utf-8 maybe you want to utf8_encode array keys only. This will do it:
<?php
/**
* (Recursively) utf8_encode all array keys.
*
* @param array $array
* @return array with utf8_encoded keys
*/
function utf8_encode_array_keys($array)
{
$array_type = array_type($array);
if ($array_type == "map")
{
$result_array = array();
foreach($array as $key => $value)
{
if (is_array($value))
{
// recursion
$result_array[utf8_encode($key)] = utf8_encode_array_keys($value);
}
else
{
// value is not an array, no recursion
$result_array[utf8_encode($key)] = $value;
}
}
return $result_array;
}
else if ($array_type == "vector")
{
// do not encode anything, just follow the value if it is an array
$result_array = array();
foreach ($array as $key => $value)
{
if (is_array($value))
{
// recursion
$result_array[$key] = utf8_encode_array_keys($value);
}
else
{
// value is not an array, no recursion
$result_array[$key] = $value;
}
}
return $result_array;
}
return false; // argument is not an array, return false
}
?>
Also note that both this operation (with keys only) and the operation with both keys and values can be reversed by replacing "encode" by "decode".
hillar dot petersen at gmail dot com
29-May-2007 07:36
29-May-2007 07:36
If you are interested in recursively converting ISO-8859-1-encoded arrays into UTF-8, then this is one way to do it. Could use a small refactor though. (I used it to prepare some ISO-8859-1 arrays for json_encode. Note that for this to work your values and for associative arrays also your keys must be ISO-8859-1-encoded.)
<?php
/**
* (Recursively) utf8_encode each value in an array.
*
* @param array $array
* @return array utf8_encoded
*/
function utf8_encode_array($array)
{
if (is_array($array))
{
$result_array = array();
foreach($array as $key => $value)
{
if (array_type($array) == "map")
{
// encode both key and value
if (is_array($value))
{
// recursion
$result_array[utf8_encode($key)] = utf8_encode_array($value);
}
else
{
// no recursion
if (is_string($value))
{
$result_array[utf8_encode($key)] = utf8_encode($value);
}
else
{
// do not re-encode non-strings, just copy data
$result_array[utf8_encode($key)] = $value;
}
}
}
else if (array_type($array) == "vector")
{
// encode value only
if (is_array($value))
{
// recursion
$result_array[$key] = utf8_encode_array($value);
}
else
{
// no recursion
if (is_string($value))
{
$result_array[$key] = utf8_encode($value);
}
else
{
// do not re-encode non-strings, just copy data
$result_array[$key] = $value;
}
}
}
}
return $result_array;
}
return false; // argument is not an array, return false
}
/**
* Determines array type ("vector" or "map"). Returns false if not an array at all.
* (I hope a native function will be introduced in some future release of PHP, because
* this check is inefficient and quite costly in worst case scenario.)
*
* @param array $array The array to analyze
* @return string array type ("vector" or "map") or false if not an array
*/
function array_type($array)
{
if (is_array($array))
{
$next = 0;
$return_value = "vector"; // we have a vector until proved otherwise
foreach ($array as $key => $value)
{
if ($key != $next)
{
$return_value = "map"; // we have a map
break;
}
$next++;
}
return $return_value;
}
return false; // not array
}
?>
nikooo adog bk adot ru - Nickolaz
03-May-2007 07:32
03-May-2007 07:32
You can use this simple code to convert win-1251 into Unicode.
function rus2uni($str,$isTo = true)
{
$arr = array('ё'=>'ё','Ё'=>'Ё');
for($i=192;$i<256;$i++)
$arr[chr($i)] = ''.dechex($i-176).';';
$str =preg_replace(array('@([а-я]) @i','@ ([а-я])@i'),array('$1 ',' $1'),$str);
return strtr($str,$isTo?$arr:array_flip($arr));
}
That is useful for xml_parser (to parse windows-1251 files like utf-8).
18-Apr-2007 09:36
I just read what I wrote, sorry for the typos it was a long day:
here's the rewritten code:
xml_tpl.php
<?php
header("Content-Type: text/html;charset=ISO-8859-1");
print "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
$names=array('jack','bob','vanessa','catherine','valerie');
?>
<parent>
<?php foreach($names as $name) {?>
<child name="<?php print $name?>" />
<?php } ?>
</parent>
<?php
function create_xml(){
ob_start();
include "xml_tpl.php";
$trapped_content=ob_get_contents();
ob_end_clean();
$file_path= "./somefile.xml";
$file_handle=fopen($file_path,'w');
fwrite($file_handle,utf8_encode($trapped_content));
}
?>
penda ekoka
17-Apr-2007 11:45
17-Apr-2007 11:45
creating utf-8 xml files:
this is something that has wasted a lot of my time, I hope this will spare you the headaches:
my method consists of creating an xml template that will look like this (this is probably optional, I'm sure you can use good ol' print or echo statements):
xml_tpl.php
<?php
header("Content-Type: text/html;charset=ISO-8859-1");
print "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
$names=array('jack','bob','vanessa','catherine','valerie');
?>
<parent>
<?php foreach($names as $name) {?>
<child name="<?php print $name?>" />
<?php } ?>
</parent>
?>
from a function or a method I include the previous template and trap the outputted content in an output buffer. The buffured content is then inserted into a file:
<?php
function create_xml(){
ob_start();
include "xml_php.php";
$trapped_content=ob_get_contents();
ob_end_clean();
$file_path= "./somefile.xml";
$file_handle=fopen($somefile,'w');
fwrite($file_handle,utf8_encode($trapped_content));
}
?>
Some side notes:
- note that the utf8_encode function goes inside the fwrite() function.
- when troubleshooting, make sure to transfer text file (xml included) and scripts in ascii mode when using ftp. For some unknown reason my ftp client did not have xml set as an ascii transfer candidate and was automatically tranfering them in binary. That little "feature" ended up costing me hours of frustration, as the encoding information would just "vanish" between transfer and I kept scratching my head as to why manually created utf8 files were not behaving as they should.
29-Mar-2007 03:37
<?php
function unicon($str, $to_uni = true) {
$cp = Array (
"А" => "А", "а" => "а",
"Б" => "Б", "б" => "б",
"В" => "В", "в" => "в",
"Г" => "Г", "г" => "г",
"Д" => "Д", "д" => "д",
"Е" => "Е", "е" => "е",
"Ё" => "Ё", "ё" => "ё",
"Ж" => "Ж", "ж" => "ж",
"З" => "З", "з" => "з",
"И" => "И", "и" => "и",
"Й" => "Й", "й" => "й",
"К" => "К", "к" => "к",
"Л" => "Л", "л" => "л",
"М" => "М", "м" => "м",
"Н" => "Н", "н" => "н",
"О" => "О", "о" => "о",
"П" => "П", "п" => "п",
"Р" => "Р", "р" => "р",
"С" => "С", "с" => "с",
"Т" => "Т", "т" => "т",
"У" => "У", "у" => "у",
"Ф" => "Ф", "ф" => "ф",
"Х" => "Х", "х" => "х",
"Ц" => "Ц", "ц" => "ц",
"Ч" => "Ч", "ч" => "ч",
"Ш" => "Ш", "ш" => "ш",
"Щ" => "Щ", "щ" => "щ",
"Ъ" => "Ъ", "ъ" => "ъ",
"Ы" => "Ы", "ы" => "ы",
"Ь" => "Ь", "ь" => "ь",
"Э" => "Э", "э" => "э",
"Ю" => "Ю", "ю" => "ю",
"Я" => "Я", "я" => "я"
);
if ($to_uni) {
$str = strtr($str, $cp);
} else {
foreach ($cp as $c) {
$cpp[$c] = array_search($c, $cp);
}
$str = strtr($str, $cpp);
}
return $str;
}
?>
emze at donazga dot net
17-Dec-2006 11:12
17-Dec-2006 11:12
/*
Every function seen so far is incomplete or resource consumpting. Here are two -- integer 2 utf sequence (i3u) and utf sequence to integer (u3i). Below is a code snippet that checks well behavior at the range boundaries.
Someday they might be hardcoded into PHP...
*/
function i3u($i) { // returns UCS-16 or UCS-32 to UTF-8 from an integer
$i=(int)$i; // integer?
if ($i<0) return false; // positive?
if ($i<=0x7f) return chr($i); // range 0
if (($i & 0x7fffffff) <> $i) return '?'; // 31 bit?
if ($i<=0x7ff) return chr(0xc0 | ($i >> 6)) . chr(0x80 | ($i & 0x3f));
if ($i<=0xffff) return chr(0xe0 | ($i >> 12)) . chr(0x80 | ($i >> 6) & 0x3f)
. chr(0x80 | $i & 0x3f);
if ($i<=0x1fffff) return chr(0xf0 | ($i >> 18)) . chr(0x80 | ($i >> 12) & 0x3f)
. chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
if ($i<=0x3ffffff) return chr(0xf8 | ($i >> 24)) . chr(0x80 | ($i >> 18) & 0x3f)
. chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
return chr(0xfc | ($i >> 30)) . chr(0x80 | ($i >> 24) & 0x3f) . chr(0x80 | ($i >> 18) & 0x3f)
. chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
}
function u3i($s,$strict=1) { // returns integer on valid UTF-8 seq, NULL on empty, else FALSE
// NOT strict: takes only DATA bits, present or not; strict: length and bits checking
if ($s=='') return NULL;
$l=strlen($s); $o=ord($s{0});
if ($o <= 0x7f && $l==1) return $o;
if ($l>6 && $strict) return false;
if ($strict) for ($i=1;$i<$l;$i++) if (ord($s{$i}) > 0xbf || ord($s{$i})< 0x80) return false;
if ($o < 0xc2) return false; // no-go even if strict=0
if ($o <= 0xdf && ($l=2 && $strict)) return (($o & 0x1f) << 6 | (ord($s{1}) & 0x3f));
if ($o <= 0xef && ($l=3 && $strict)) return (($o & 0x0f) << 12 | (ord($s{1}) & 0x3f) << 6
| (ord($s{2}) & 0x3f));
if ($o <= 0xf7 && ($l=4 && $strict)) return (($o & 0x07) << 18 | (ord($s{1}) & 0x3f) << 12
| (ord($s{2}) & 0x3f) << 6 | (ord($s{3}) & 0x3f));
if ($o <= 0xfb && ($l=5 && $strict)) return (($o & 0x03) << 24 | (ord($s{1}) & 0x3f) << 18
| (ord($s{2}) & 0x3f) << 12 | (ord($s{3}) & 0x3f) << 6 | (ord($s{4}) & 0x3f));
if ($o <= 0xfd && ($l=6 && $strict)) return (($o & 0x01) << 30 | (ord($s{1}) & 0x3f) << 24
| (ord($s{2}) & 0x3f) << 18 | (ord($s{3}) & 0x3f) << 12
| (ord($s{4}) & 0x3f) << 6 | (ord($s{5}) & 0x3f));
return false;
}
// boundary behavior checking
$do=array(0x7f,0x7ff,0xffff,0x1fffff,0x3ffffff,0x7fffffff);
foreach ($do as $ii) for ($i=$ii;$i<=$ii+1; $i++) {
$o=i3u($i);
for ($j=0;$j<strlen($o);$j++) print "O[$j]=" . sprintf('%08b',ord($o{$j})) . ", ";
print "c=$i, o=[$o].\n";
print "Back: [$o] => [" . u3i($o) . "]\n";
}
sadikkeskin at hotmail dot com
21-Nov-2006 04:19
21-Nov-2006 04:19
i wrote a function to convert encoding utf8 to iso-8859-9. This function is very useful if you want to use this for ajax.
you can apply same way for other languages.
<?
function str_encode ($string,$to="iso-8859-9",$from="utf8") {
if($to=="iso-8859-9" && $from=="utf8"){
$str_array = array(
chr(196).chr(177) => chr(253),
chr(196).chr(176) => chr(221),
chr(195).chr(182) => chr(246),
chr(195).chr(150) => chr