Being a brief code fragment that makes me happy.
There’s this little 10-byte file called 4c
like so:
~/dev/rx/ 627> hexdump 4c
0000000 26 d0 96 e4 b8 ad f0 90 8d 86
These bytes are the UTF-8 encoding of a particular four-character string as described in Characters vs. Bytes.
I’m running Ruby 1.9 as checked out from svn earlier today:
~/dev/rx/ 628> ruby -v
ruby 1.9.0 (2008-09-19 revision 19423) [i386-darwin9.4.0]
There’s a new method, String#each_codepoint:
~/dev/rx/ 629> ri String#each_codepoint
-------------------------------------------------- String#each_codepoint
str.each_codepoint {|integer| block } => str
------------------------------------------------------------------------
Passes the +Integer+ ordinal of each character in _str_, also known
as a _codepoint_ when applied to Unicode strings to the given
block.
"hello\u0639".each_codepoint {|c| print c, ' ' }
_produces:_
104 101 108 108 111 1593
And it works! (Disclaimer: I probably am not using the best and simplest idiom.)
~/dev/rx/ 630> irb
irb(main):001:0> u = File.read('4c').force_encoding('UTF-8')
=> "&Ж中𐍆"
irb(main):002:0> u.each_codepoint {|c| printf("U+%04X\n", c) }
U+0026
U+0416
U+4E2D
U+10346
Further background and explanation may be found here. I felt like writing back saying “And can we have ponies, too?”
Comment feed for ongoing:
From: Lars Marius Garshol (Sep 19 2008, at 04:28)
In other words: each_codepoint really does what it advertises, and does not treat UTF-16 surrogates as code points, but shows the last character as a single code point, instead of as the two units used to encode it in UTF-16.
That really is good, and is certainly more than Java can do: http://java.sun.com/javase/6/docs/api/java/lang/String.html#length()
[link]
From: g (Sep 20 2008, at 16:16)
What a pity that the Linear A syllabary hasn't yet made it into the Unicode standard: it would have been amusing to have U+10646 instead of U+10346.
[link]
From: Jay Carlson (Sep 20 2008, at 18:25)
Sometimes I think Ruby's text processing is an elaborate parody of the English-centric Unix world's attitudes decades ago.
[link]
From: Julian Reschke (Sep 21 2008, at 03:25)
Lars,
"That really is good, and is certainly more than Java can do" -- see: http://java.sun.com/javase/6/docs/api/java/lang/CharSequence.html
[link]
From: Gaute Strokkenes (Sep 23 2008, at 22:59)
Julian: Lars' point is that Java strings can only cope with non-BMP characters by means of surrogate pairs, i.e. encoding them with pairs of chars, rather than a single pair. This is a common misfeature of all UTF-16 systems, and I see nothing in the Javadoc you quoted to refute this.
[link]