Ruby/String/encoding

Материал из Wiki.crossplatform.ru

(Различия между версиями)
Перейти к: навигация, поиск

Версия 17:10, 26 мая 2010

Содержание

Check the string encoding

# -*- coding: utf-8 -*-
s = "2×2=4"     # Note multibyte multiplication character
s.encoding      # => <Encoding: UTF-8>
t = "2+2=4"     # All characters are in the ASCII subset of UTF-8
t.encoding      # => <Encoding: ASCII-8BIT>



Encoding and bytesize

euro1 = "\u20AC"                     # Start with the Unicode Euro character
puts euro1                           # Prints "&#x20AC;"
euro1.encoding                       # => <Encoding:UTF-8>
euro1.bytesize                       # => 3
euro2 = euro1.encode("iso-8859-15")  # Transcode to Latin-15
puts euro2.inspect                   # Prints "\xA4"
euro2.encoding                       # => <Encoding:iso-8859-15>
euro2.bytesize                       # => 1
euro3 = euro2.encode("utf-8")        # Transcode back to UTF-8
euro1 == euro3                       # => true



Encoding constant

Encoding::ASCII_8BIT     # Also ::BINARY
Encoding::UTF_8          # UTF-8-encoded Unicode characters
Encoding::EUC_JP         # EUC-encoded Japanese
Encoding::SHIFT_JIS      # Japanese: also ::SJIS, ::WINDOWS_31J, ::CP932



Get byte from a unicode string

$KCODE = "u"        # Specify Unicode UTF-8, or start Ruby with -Ku option
require "jcode"     # Load multibyte character support
mb = "2\303\2272=4" # This is "2&#xD7;2=4" with a Unicode multiplication sign
mb.each_byte do |c| # Iterate through the bytes of the string.
  print c, " "      # c is Fixnum
end                 # Outputs "50 195 151 50 61 52 "



Get each char in a unicode

$KCODE = "u"        # Specify Unicode UTF-8, or start Ruby with -Ku option
require "jcode"     # Load multibyte character support
mb = "2\303\2272=4" # This is "2&#xD7;2=4" with a Unicode multiplication sign
 
mb.each_char do |c| # Iterate through the characters of the string
  print c, " "      # c is a String with jlength 1 and variable length
end                 # Outputs "2 &#xD7; 2 = 4 "



Get the position of the first multibyte char

$KCODE = "u"        # Specify Unicode UTF-8, or start Ruby with -Ku option
require "jcode"     # Load multibyte character support
mb = "2\303\2272=4" # This is "2&#xD7;2=4" with a Unicode multiplication sign
mb.mbchar?          # => 1: position of the first multibyte char, or nil



Get utf-8 encoding

encoding = Encoding.find("utf-8")



Interpret a byte as an iso-codepoint, and transcode to UTF-8

byte = "\xA4"
char = byte.encode("utf-8", "iso-8859-15")



Specify Unicode UTF-8, or start Ruby with -Ku option

$KCODE = "u"        # Specify Unicode UTF-8, or start Ruby with -Ku option
require "jcode"     # Load multibyte character support
mb = "2\303\2272=4" # This is "2&#xD7;2=4" with a Unicode multiplication sign
puts mb.length      
puts mb.jlength     # => 5: but only 5 characters



The iso-ncoding doesn"t have a Euro sign, so this raises an exception

"\u20AC".encode("iso-8859-1")