Difference between [0-9], [[:digit:]] and d

Yes, it is [[:digit:]] ~ [0-9] ~ d (where ~ means aproximate).
In most programming languages (where it is supported) d ≡ [[:digit:]] (identical).
The d is less common than [[:digit:]] (not in POSIX but it is in GNU grep -P).

There are many digits in UNICODE, for example:

123456789 # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI

All of which may be included in [[:digit:]] or d.

Instead, [0-9] is generally only the ASCII digits 0123456789.


There are many languages: Perl, Java, Python, C. In which [[:digit:]] (and d) calls for an extended meaning. For example, this perl code will match all the digits from above:

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which is equivalent to select all characters that have the Unicode properties of Numeric and digits:

$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):

$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९

Change it to [0-9] to see:

$ echo "$a" | grep -o '[0-9]+'
0123456789

POSIX

For the specific POSIX BRE or ERE:
The d is not supported (not in POSIX but is in GNU grep -P). [[:digit:]] is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9][0123456789]d and [[:digit:]] mean exactly the same. The [0123456789] has no possible misinterpretations, [[:digit:]] is available in more utilities and it is common to mean only [0123456789]. The d is supported by few utilities.

As for [0-9], the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).

原文地址:https://www.cnblogs.com/kakaisgood/p/9645277.html