Ruby/String/encoding
Материал из Wiki.crossplatform.ru
Версия от 17:10, 26 мая 2010; (Обсуждение)
Check the string encoding
# -*- coding: utf-8 -*- s = "2×2=4" # Note multibyte multiplication character s.encoding # => <Encoding: UTF-8> t = "2+2=4" # All characters are in the ASCII subset of UTF-8 t.encoding # => <Encoding: ASCII-8BIT>
Encoding and bytesize
euro1 = "\u20AC" # Start with the Unicode Euro character puts euro1 # Prints "€" euro1.encoding # => <Encoding:UTF-8> euro1.bytesize # => 3 euro2 = euro1.encode("iso-8859-15") # Transcode to Latin-15 puts euro2.inspect # Prints "\xA4" euro2.encoding # => <Encoding:iso-8859-15> euro2.bytesize # => 1 euro3 = euro2.encode("utf-8") # Transcode back to UTF-8 euro1 == euro3 # => true
Encoding constant
Encoding::ASCII_8BIT # Also ::BINARY Encoding::UTF_8 # UTF-8-encoded Unicode characters Encoding::EUC_JP # EUC-encoded Japanese Encoding::SHIFT_JIS # Japanese: also ::SJIS, ::WINDOWS_31J, ::CP932
Get byte from a unicode string
$KCODE = "u" # Specify Unicode UTF-8, or start Ruby with -Ku option require "jcode" # Load multibyte character support mb = "2\303\2272=4" # This is "2×2=4" with a Unicode multiplication sign mb.each_byte do |c| # Iterate through the bytes of the string. print c, " " # c is Fixnum end # Outputs "50 195 151 50 61 52 "
Get each char in a unicode
$KCODE = "u" # Specify Unicode UTF-8, or start Ruby with -Ku option require "jcode" # Load multibyte character support mb = "2\303\2272=4" # This is "2×2=4" with a Unicode multiplication sign mb.each_char do |c| # Iterate through the characters of the string print c, " " # c is a String with jlength 1 and variable length end # Outputs "2 × 2 = 4 "
Get the position of the first multibyte char
$KCODE = "u" # Specify Unicode UTF-8, or start Ruby with -Ku option require "jcode" # Load multibyte character support mb = "2\303\2272=4" # This is "2×2=4" with a Unicode multiplication sign mb.mbchar? # => 1: position of the first multibyte char, or nil
Get utf-8 encoding
encoding = Encoding.find("utf-8")
Interpret a byte as an iso-codepoint, and transcode to UTF-8
byte = "\xA4" char = byte.encode("utf-8", "iso-8859-15")
Specify Unicode UTF-8, or start Ruby with -Ku option
$KCODE = "u" # Specify Unicode UTF-8, or start Ruby with -Ku option require "jcode" # Load multibyte character support mb = "2\303\2272=4" # This is "2×2=4" with a Unicode multiplication sign puts mb.length puts mb.jlength # => 5: but only 5 characters
The iso-ncoding doesn"t have a Euro sign, so this raises an exception
"\u20AC".encode("iso-8859-1")