PHP 多字节字符串函数

参考资料

多字节字符编码方案和他们相关的问题相当复杂，超越了本文档的范围。关于这些话题的更多信息请参考以下 URL 和其他资源。

Unicode materials

» http://www.unicode.org/
Japanese/Korean/Chinese 字符信息

» http://examples.oreilly.com/cjkvinfo/doc/cjk.inf

mb_check_encoding — 检查字符串在指定的编码里是否有效
mb_convert_case — 对字符串进行大小写转换
mb_convert_encoding — 转换字符的编码
mb_convert_kana — Convert "kana" one from another ("zen-kaku", "han-kaku" and more)
mb_convert_variables — 转换一个或多个变量的字符编码
mb_decode_mimeheader — 解码 MIME 头字段中的字符串
mb_decode_numericentity — 根据 HTML 数字字符串解码成字符
mb_detect_encoding — 检测字符的编码
mb_detect_order — 设置/获取字符编码的检测顺序
mb_encode_mimeheader — 为 MIME 头编码字符串
mb_encode_numericentity — Encode character to HTML numeric string reference
mb_encoding_aliases — Get aliases of a known encoding type
mb_ereg_match — Regular expression match for multibyte string
mb_ereg_replace_callback — Perform a regular expresssion seach and replace with multibyte support using a callback
mb_ereg_replace — Replace regular expression with multibyte support
mb_ereg_search_getpos — Returns start point for next regular expression match
mb_ereg_search_getregs — Retrieve the result from the last multibyte regular expression match
mb_ereg_search_init — Setup string and regular expression for a multibyte regular expression match
mb_ereg_search_pos — Returns position and length of a matched part of the multibyte regular expression for a predefined multibyte string
mb_ereg_search_regs — Returns the matched part of a multibyte regular expression
mb_ereg_search_setpos — Set start point of next regular expression match
mb_ereg_search — Multibyte regular expression match for predefined multibyte string
mb_ereg — Regular expression match with multibyte support
mb_eregi_replace — Replace regular expression with multibyte support ignoring case
mb_eregi — Regular expression match ignoring case with multibyte support
mb_get_info — 获取 mbstring 的内部设置
mb_http_input — 检测 HTTP 输入字符编码
mb_http_output — 设置/获取 HTTP 输出字符编码
mb_internal_encoding — 设置/获取内部字符编码
mb_language — 设置/获取当前的语言
mb_list_encodings — 返回所有支持编码的数组
mb_output_handler — 在输出缓冲中转换字符编码的回调函数
mb_parse_str — 解析 GET/POST/COOKIE 数据并设置全局变量
mb_preferred_mime_name — 获取 MIME 字符串
mb_regex_encoding — Set/Get character encoding for multibyte regex
mb_regex_set_options — Set/Get the default options for mbregex functions
mb_send_mail — 发送编码过的邮件
mb_split — 使用正则表达式分割多字节字符串
mb_strcut — 获取字符的一部分
mb_strimwidth — 获取按指定宽度截断的字符串
mb_stripos — 大小写不敏感地查找字符串在另一个字符串中首次出现的位置
mb_stristr — 大小写不敏感地查找字符串在另一个字符串里的首次出现
mb_strlen — 获取字符串的长度
mb_strpos — 查找字符串在另一个字符串中首次出现的位置
mb_strrchr — 查找指定字符在另一个字符串中最后一次的出现
mb_strrichr — 大小写不敏感地查找指定字符在另一个字符串中最后一次的出现
mb_strripos — 大小写不敏感地在字符串中查找一个字符串最后出现的位置
mb_strrpos — 查找字符串在一个字符串中最后出现的位置
mb_strstr — 查找字符串在另一个字符串里的首次出现
mb_strtolower — 使字符串小写
mb_strtoupper — 使字符串大写
mb_strwidth — 返回字符串的宽度
mb_substitute_character — 设置/获取替代字符
mb_substr_count — 统计字符串出现的次数
mb_substr — 获取字符串的部分

mb_check_encoding

PHP字符编码的要求

[edit] Last updated: Fri, 12 Jul 2013

add a note User Contributed Notes 多字节字符串函数 - [29 notes]

down

marc at ermshaus dot org

4 years ago


A small correction to patrick at hexane dot org's mb_str_replace 
function. The original function does not work as intended in case 
$replacement contains $needle.



<?php

function mb_str_replace($needle, $replacement, $haystack)

{

    $needle_len = mb_strlen($needle);

    $replacement_len = mb_strlen($replacement);

    $pos = mb_strpos($haystack, $needle);

    while ($pos !== false)

    {

        $haystack = mb_substr($haystack, 0, $pos) . $replacement

                . mb_substr($haystack, $pos + $needle_len);

        $pos = mb_strpos($haystack, $needle, $pos + $replacement_len);

    }

    return $haystack;

}

?>

down

efesar

2 years ago


This small mb_trim function works for me. 





<?php


function mb_trim( $string )


{


    $string = preg_replace( "/(^s+)|(s+$)/us", "", $string );


    


    return $string;


}


?>

down

johannesponader at dontspamme dot googlemail dot co

2 years ago


Please note that when migrating code to handle UTF-8 encoding, not only 
the functions mentioned here are useful, but also the function 
htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, 
"UTF-8") or similar. I didn't scan the manual for it, but there could be
 some more functions that need adjustments like this.

down

chris at maedata dot com

6 years ago


The opposite of what Eugene Murai wrote in a previous comment is true 
when importing/uploading a file. For instance, if you export an Excel 
spreadsheet using the Save As Unicode Text option, you can use the 
following to convert it to UTF-8 after uploading:



//Convert file to UTF-8 in case Windows mucked it up

$file = explode( "
", mb_convert_encoding( trim( file_get_contents( $_FILES['file']['tmp_name'] ) ), 'UTF-8', 'UTF-16' ) );

down

mdoocy at u dot washington dot edu

6 years ago


Note that some of the multi-byte functions run in O(n) time, rather than
 constant time as is the case for their single-byte equivalents. This 
includes any functionality requiring access at a specific index, since 
random access is not possible in a string whose number of bytes will not
 necessarily match the number of characters. Affected functions include:
 mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.

down

deceze at gmail dot com

10 months ago


Please note that all the discussion about mb_str_replace in the comments
 is pretty pointless. str_replace works just fine with multibyte 
strings:



<?php



$string  = '漢字はユニコード';

$needle  = 'は';

$replace = 'Foo';



echo str_replace($needle, $replace, $string);

// outputs: 漢字Fooユニコード



?>



The usual problem is that the string is evaluated as binary string, 
meaning PHP is not aware of encodings at all. Problems arise if you are 
getting a value "from outside" somewhere (database, POST request) and 
the encoding of the needle and the haystack is not the same. That 
typically means the source code is not saved in the same encoding as you
 are receiving "from outside". Therefore the binary representations 
don't match and nothing happens.

down

-1

phpnet at rcpt dot at

2 years ago


<?php

/**

* Multibyte safe version of trim()

* Always strips whitespace characters (those equal to s)

*

* @author Peter Johnson

* @email phpnet@rcpt.at

* @param $string The string to trim

* @param $chars Optional list of chars to remove from the string ( as per trim() )

* @param $chars_array Optional array of preg_quote'd chars to be removed

* @return string

*/

public static function mb_trim( $string, $chars = "", $chars_array = array() )

{

    for( $x=0; $x<iconv_strlen( $chars ); $x++ ) $chars_array[] = preg_quote( iconv_substr( $chars, $x, 1 ) );

    $encoded_char_list = implode( "|", array_merge( array( "s","	","
","
", "", "x0B" ), $chars_array ) );



    $string = mb_ereg_replace( "^($encoded_char_list)*", "", $string );

    $string = mb_ereg_replace( "($encoded_char_list)*$", "", $string );

    return $string;

}

?>

down

-1

mt at mediamedics dot nl

3 years ago


A multibyte one-to-one alternative for the str_split function (http://php.net/manual/en/function.str-split.php):



<?php

    function mb_str_split($string, $split_length = 1){

            

        mb_internal_encoding('UTF-8'); 

        mb_regex_encoding('UTF-8');  

        

        $split_length = ($split_length <= 0) ? 1 : $split_length;

        

        $mb_strlen = mb_strlen($string, 'utf-8');

        

        $array = array();

                

        for($i = 0; $i < $mb_strlen; $i + $split_length){

        

            $array[] = mb_substr($string, $i, $split_length); 

        }



        return $array;

    

    }

?>

down

rawsrc at gmail dot com

1 year ago


Hi,



For those who are looking for mb_str_replace, here's a simple function :

<?php

function mb_str_replace($needle, $replacement, $haystack) {

   return implode($replacement, mb_split($needle, $haystack));

}

?>

I haven't found a simpliest way to proceed :-)

down

peter AT(no spam) dezzignz dot com

3 years ago


The function trim() has not failed me so far in my multibyte 
applications, but in case one needs a truly multibyte function, here it 
is. The nice thing is that the character to remove can be whitespace or 
any other specified character, even a multibyte character.



<?php



// multibyte string split



function mbStringToArray ($str) {

    if (empty($str)) return false;

    $len = mb_strlen($str);

    $array = array();

    for ($i = 0; $i < $len; $i++) {

        $array[] = mb_substr($str, $i, 1);

        }

    return $array;

    }



// removes $rem at both ends



function mb_trim ($str, $rem = ' ') {

    if (empty($str)) return false;

    // convert to array

    $arr = mbStringToArray($str);

    $len = count($arr);

    // left side

    for ($i = 0; $i < $len; $i++) {

        if ($arr[$i] === $rem) $arr[$i] = '';

        else break;

        }

    // right side

    for ($i = $len-1; $i >= 0; $i--) {

        if ($arr[$i] === $rem) $arr[$i] = '';

        else break;

        }

    // convert to string

    return implode ('', $arr);

    }



?>

down

roydukkey at roydukkey dot com

3 years ago


This would be one way to create a multibyte substr_replace function





<?php


function mb_substr_replace($output, $replace, $posOpen, $posClose) {


        return mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);


    }


?>

down

sakai at d4k dot net

4 years ago


I hope this mb_str_replace will work for arrays.  Please use 
mb_internal_encoding() beforehand, if you need to change the encoding.



Thanks to marc at ermshaus dot org for the original.



<?php



if(!function_exists('mb_str_replace')) {



    function mb_str_replace($search, $replace, $subject) {



        if(is_array($subject)) {

            $ret = array();

            foreach($subject as $key => $val) {

                $ret[$key] = mb_str_replace($search, $replace, $val);

            }

            return $ret;

        }



        foreach((array) $search as $key => $s) {

            if($s == '') {

                continue;

            }

            $r = !is_array($replace) ? $replace : (array_key_exists($key, $replace) ? $replace[$key] : '');

            $pos = mb_strpos($subject, $s);

            while($pos !== false) {

                $subject = mb_substr($subject, 0, $pos) . $r . mb_substr($subject, $pos + mb_strlen($s));

                $pos = mb_strpos($subject, $s, $pos + mb_strlen($r));

            }

        }



        return $subject;



    }



}



?>

down

mitgath at gmail dot com

4 years ago


according to:


http://bugs.php.net/bug.php?id=21317


here's missing function





<?php


function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {


   return str_pad($input,


strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);


}


?>

down

Ben XO

4 years ago


PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but
 with the added bonus of PCRE character classes (including, of course, 
all the useful Unicode ones such as pZ).





Unlike other approaches that I've seen to this problem, I wanted to 
emulate the full functionality of trim() - in particular, the ability to
 customise the character list.





<?php


    /**


     * Trim characters from either (or both) ends of a string in a way that is


     * multibyte-friendly.


     *


     * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as


     * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of


     * course, the added bonus that you can put unicode characters in the charlist.


     *


     * We are using a PCRE character-class to do the trimming in a unicode-aware


     * way, so we must escape ^, \, - and ] which have special meanings here.


     * As you would expect, a single  in the charlist is interpretted as


     * "trim backslashes" (and duly escaped into a double- ). Under most circumstances


     * you can ignore this detail.


     *


     * As a bonus, however, we also allow PCRE special character-classes (such as 's')


     * because they can be extremely useful when dealing with UCS. 'pZ', for example,


     * matches every 'separator' character defined in Unicode, including non-breaking


     * and zero-width spaces.


     *


     * It doesn't make sense to have two or more of the same character in a character


     * class, therefore we interpret a double  in the character list to mean a


     * single  in the regex, allowing you to safely mix normal characters with PCRE


     * special classes.


     *


     * *Be careful* when using this bonus feature, as PHP also interprets backslashes


     * as escape characters before they are even seen by the regex. Therefore, to


     * specify '\s' in the regex (which will be converted to the special character


     * class 's' for trimming), you will usually have to put *4* backslashes in the


     * PHP code - as you can see from the default value of $charlist.


     *


     * @param string 


     * @param charlist list of characters to remove from the ends of this string.


     * @param boolean trim the left?


     * @param boolean trim the right?


     * @return String


     */


    function mb_trim($string, $charlist='\\s', $ltrim=true, $rtrim=true)


    {


        $both_ends = $ltrim && $rtrim;





        $char_class_inner = preg_replace(


            array( '/[^-]\]/S', '/\{4}/S' ),


            array( '\\\0', '\' ),


            $charlist


        );





        $work_horse = '[' . $char_class_inner . ']+';


        $ltrim && $left_pattern = '^' . $work_horse;


        $rtrim && $right_pattern = $work_horse . '$';





        if($both_ends)


        {


            $pattern_middle = $left_pattern . '|' . $right_pattern;


        }


        elseif($ltrim)


        {


            $pattern_middle = $left_pattern;


        }


        else


        {


            $pattern_middle = $right_pattern;


        }





        return preg_replace("/$pattern_middle/usSD", '', $string) );


    }


?>

down

patrick at hexane dot org

5 years ago


I wonder why there isn't a mb_str_replace().  Here's one for now:



function mb_str_replace( $needle, $replacement, $haystack ) {

  $needle_len = mb_strlen($needle);

  $pos = mb_strpos( $haystack, $needle);

  while (!($pos ===false)) {

    $front = mb_substr( $haystack, 0, $pos );

    $back  = mb_substr( $haystack, $pos + $needle_len);

    $haystack = $front.$replacement.$back;

    $pos = mb_strpos( $haystack, $needle);

  }

  return $haystack;

}

down

motin at demomusic dot nu

6 years ago


As peter dot albertsson at spray dot se already pointed out, overloading
 strlen may break code that handles binary data and relies upon strlen 
for bytelengths. 



The problem occurs when a file is filled with a string using fwrite in the following manner:



$len = strlen($data);

fwrite($fp, $data, $len);



fwrite takes amount of bytes as the third parameter, but mb_strlen 
returns the amount of characters in the string. Since multibyte 
characters are possibly more than one byte in length each - this will 
result in that the last characters of $data never gets written to the 
file. 



After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found. 



I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:



/**

* PHP Singe byte functions simulation (non successful)

* 

* Usage: sb_string(functionname, arg1, arg2, etc);

* Example: sb_string("strlen", "tuöéä"); returns 8 (should...)

*/

function sb_string() {



  $arguments = func_get_args(); 



  $func_overloading = ini_get("mbstring.func_overload");



  ini_set("mbstring.func_overload", 0);



  $ret = call_user_func_array(array_shift($arguments), $arguments);



  ini_set("mbstring.func_overload", $func_overloading);



  return $ret;



}

down

pdezwart .at. snocap

6 years ago


If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE

down

hayk at mail dot ru

6 years ago


Since PHP 5.1.0 and PHP 4.4.2 there is an Armenian ArmSCII-8 (ArmSCII-8, ArmSCII8, ARMSCII-8, ARMSCII8) encoding avaliable.

down

daniel at softel dot jp

6 years ago


Note that although "multi-byte" hints at total internationalization, the
 mb_ API was designed by a Japanese person to support the Japanese 
language.



Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.



It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.



I don't mean any disrespect to the mb_ API because I'm using it everyday
 and I appreciate its usefulness, but maybe a better name would be the 
jp_ API.

down

Aardvark

7 years ago


Since not all hosted servces currently support the multi-byte function 
set, it may still be necessary to process Unicode strings using standard
 single byte functions.  The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html
 - shows by example how to do this.  While this only covers UTF-8, the 
standard PHP function "iconv" allows conversion into and out of UTF-8 if
 strings need to be input or output in other encodings.

down

peter kehl

7 years ago


UTF-16LE solution for CSV for Excel by Eugene Murai works well:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');



However, then Excel on Mac OS X doesn't identify columns properly and 
its puts each whole row in its own cell. In order to fix that, use TAB 
"\t" character as CSV delimiter rather than comma or colon.



You may also want to use HTTP encoding header, such as

header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );

down

Anonymous

7 years ago


get the string octet-size, when mbstring.func_overload is set to 2 :



<?php

function str_sizeof($string) {

    return count(preg_split("`.`", $string)) - 1 ;

}

?>



answering to peter albertsson, once you got your data octet-size, you can access each octet with something

$string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.

down

peter dot albertsson at spray dot se

8 years ago


Setting mbstring.func_overload = 2 may break your applications that deal with binary data.



After having set mbstring.func_overload = 2 and  
mbstring.internal_encoding = UTF-8 I can't even read a binary file and 
print/echo it to output without corrupting it.

down

nzkiwi at NOSPAMmte dot biglobe dot ne dot jp

8 years ago


A friend has pointed out that the entry 

"mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page 
appears to be wrong: above Example 4 it says that "There is no way to 
control HTTP input character conversion from PHP script. To disable HTTP
 input character conversion, it has to be done in php.ini". 

Also the table shows the old-PHP-version defaults: 

;; Disable HTTP Input conversion 

mbstring.http_input = pass  *BUT* (for PHP 4.3.0 or higher) 

;; Disable HTTP Input conversion 

mbstring.encoding_translation = Off

down

Eugene Murai

8 years ago


PHP can input and output Unicode, but a little different from what 
Microsoft means: when Microsoft says "Unicode", it unexplicitly means 
little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's 
"UTF-16" means big-endian with BOM. For this reason, PHP does not seem 
to be able to output Unicode CSV file for Microsoft Excel. Solving this 
problem is quite simple: just put BOM infront of UTF-16LE string.



Example:



$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

down

-1

Lee Byron

1 year ago


Looks like mb_str_replace is the most requested missing function from the multibyte string library.





I wanted a version of mb_str_replace with as similar a code signature 
and behavior to str_replace as possible while conforming to the code 
signature patterns of the mb library and avoiding performance pitfalls 
like unnecessary concatenations and regular expressions.





<?php


/**


 * Multibyte safe version of str_replace.


 * See http://php.net/manual/en/function.str-replace.php


 */


function mb_str_replace(


  $search,


  $replace,


  $subject,


  string $encoding = null,


  int &$count = null) {





  if (is_array($subject)) {


    $result = array();


    foreach ($subject as $item) {


      $result[] = mb_str_replace($search, $replace, $item, $encoding, $count);


    }


    return $result;


  }





  if (!is_array($search)) {


    return _mb_str_replace($search, $replace, $subject, $encoding, $count);


  }





  $replace_is_array = is_array($replace);


  foreach ($search as $key => $value) {


    $subject = _mb_str_replace(


      $value,


      $replace_is_array ? $replace[$key] : $replace,


      $subject,


      $encoding,


      $count


    );


  }


  return $subject;


}





/**


 * Implementation of mb_str_replace. Do not call directly. Enforces string parameters.


 */


function _mb_str_replace(


  string $search,


  string $replace,


  string $subject,


  string $encoding = null,


  int &$count = null) {





  $search_length = mb_strlen($search, $encoding);


  $subject_length = mb_strlen($subject, $encoding);


  $offset = 0;


  $result = '';





  while ($offset < $subject_length) {


    $match = mb_strpos($subject, $search, $offset, $encoding);


    if ($match === false) {


      if ($offset === 0) {


        // No match was ever found, just return the subject.


        return $subject;


      }


      // Append the final portion of the subject to the replaced.


      $result .=


        mb_substr($subject, $offset, $subject_length - $offset, $encoding);


      break;


    }


    if ($count !== null) {


      $count++;


    }


    $result .= mb_substr($subject, $offset, $match - $offset, $encoding);


    $result .= $replace;


    $offset = $match + $search_length;


  }





  return $result;


}


?>

down

-1

Smelly

6 years ago


Below is some code to output a UTF-8 encoded CSV in a way understandable by Excel. It requires iconv instead of mbstring.



header("Content-type: application/octet-stream");

header("Content-Transfer-Encoding: binary");

header("Content-Disposition: attachment; filename=report.xls");

    

// assume $tmpString contains UTF-8 encoded CSV:

$tmpString =  iconv ( 'UTF-8', 'UTF-16LE//IGNORE', $tmpString );



print chr(255).chr(254).$tmpString;

down

-1

motin at demomusic dot nu

6 years ago


Follow up on last note from 2007-jan-20: http://se2.php.net/manual/en/function.mb-strlen.php#72979



There is the correct way of simulating singlebyte strlen as well as some
 pitfalls to watch out for when developing in a mb-func_overload:ed 
environment.

down

-1

Geoffrey

8 years ago


For Windows users php_mbstring can be added as follows:-



if you have dowloaded  the "short" version of PHP, 

(php-4.3.10-installer.exe), download the full version . 

(php-4.3.10-Win32.zip)



unzip it, find php_mbstring.dll in

f:php-4.3.10-Win32extensions, and copy it across to your

phpextensions directory 



use Notepad to open your PHP.INI 



change the extension_dir line to read 

extension_dir = "e:phpextensions"  (or whatever your

directory is called)



remove the semi-colon on line 

 ; extension=php_mbstring.dll



save PHP.INI,  restart PHP

PHP 多字节字符串 函数

参考资料

Table of Contents

PHP 多字节字符串函数