In preparation for my presentation next weekend at
RubyConf, I’ve been poking
around at Ruby’s string-handling. One thing that text-wranglers such as
me like to do is walk through a string a character at a time, and Ruby doesn’t
make this particularly easy.
I ended up
implementing String#each_char_utf8
three times along the way.
[Update: Lots of interesting feedback, and a worth-reading
ruby-talk
thread.]
I poked around in the ActiveSupport::MultiByte
(I can’t get it
to work on my computer, but I can read the source).
It appears that the only way to look into a Ruby string and see Unicode
characters is with unpack('U*')
, or using a regexp with
$KCODE
set to 'u'
.
So, I want to go walking through a string, looking at the Unicode
characters.
Maybe I’m parsing a big XML file using mmap(2)
or some such.
What I want is an efficient String#next_char
.
This will be hard in Ruby in the general case because Strings don’t know
what encoding they’re in; there’s $KCODE
but that’s only defined
to work with regular expressions. So let’s look at the special case of
UTF-8.
Of course, if String#unpack
took a block like
String#gsub
, that would give you the tools you need. I looked at
pack.c
and it would be real work, but it doesn’t look
architecturally impossible. Failing that, let’s use unpack
anyhow:
def each_utf8_unpack(s, &block)
s.unpack('U*').each &block
end
The above sucks for big strings because you create a monster array of
integers. Regular expressions are maybe a little more efficient
(this depends on the $KCODE
setting, obviously):
def each_utf8_regex(s)
s.gsub(/./m) do |c|
yield c.unpack('U').shift
''
end
end
The unpack
voodoo is because I want the integer value of the
Unicode character.
Here is a more ambitious version, extending String
and picking
the UTF-8 apart a byte at a time:
class String
def each_char_utf8
@utf8_index = 0
while @utf8_index < length
yield next_utf8_char
end
end
def next_byte
b = self[@utf8_index]
@utf8_index += 1
return b
end
def first_utf8
b = next_byte
if b & 0x80 == 0 then return 1, b
elsif b & 0xe0 == 0xc0 then return 2, b & 0x1f
elsif b & 0xf0 == 0xe0 then return 3, b & 0x0f
else return 4, b & 0x07
end
end
def next_6bits
next_byte & 0x3f
end
def next_utf8_char
len, c = first_utf8
case len
when 2
c = (c << 6) | next_6bits
when 3
c = (c << 12) | (next_6bits << 6) | next_6bits
when 4
c = (c << 18) | (next_6bits << 12) | (next_6bits << 6) | next_6bits
end
return c
end
end
I’m pretty sure the above has more or less the right semantics; but it’s a candidate for implementation in C (or Java, for JRuby).
I tested the performance by running all three versions over 2,000,000
bytes of ongoing text, containing a few
thousand non-ASCII characters, and doing a character frequency count.
Both the regex
version and the byte-at-a-time version took around
18 seconds on my PowerBook; the unpack
version took less then
5.
I’ve asked the ruby-talk
mailing list why unpack
doesn’t take a block; let’s see what they say.
Comment feed for ongoing:
From: Jerome Lacoste (Oct 14 2006, at 11:20)
Got curious, found the link to the ruby-talk thread...
http://thread.gmane.org/gmane.comp.lang.ruby.general/177147/focus=177147
[link]
From: lars (Oct 14 2006, at 12:25)
There are some more options you may want to test:
Adding UTF8 methods to class String in Ruby
http://www.bigbold.com/snippets/posts/show/2786
Parsing UTF-8 encoded strings in Ruby
http://www.bigbold.com/snippets/posts/show/1659
Processing each character in an UTF8 string you could use something like:
utf8string.scan(/./u) { |c| puts c }
Cheers,
lars
[link]
From: Nick Munto (Oct 14 2006, at 14:31)
How about this approach:
utf8string.scan(/./u) { |c|
puts c
puts c.inspect
puts c[0]
}
[link]
From: Ceri Storey (Oct 15 2006, at 05:07)
<p>I'd heartily recommend the use of <a href="http://rubyforge.org/projects/icu4r/">ICU4R</a>, a wrapper for IBM's <a href="http://icu.sourceforge.net/">International Components for Unicode</a>, which essentially gives you a <code>UString</code> type which is convertible to and from the native <code>String</code> type. Duck typing aside, this distinction can help to ensure that you don't end up with doubly encoded utf-8, and other nonsense.</p> <p>Otherwise, it gives you useful things like localised collation, and the ability to break down a string into glyphs (so you end up with a base character, and following combining characters). </p>
[link]
From: Dominic Mitchell (Oct 15 2006, at 10:46)
<p>Ceri: icu4r works great, but I had heard rumours that it's unmaintained. I haven't checked though.</p>
<p>Tim: You might want to also investigate the <a href="http://rubyforge.org/projects/char-encodings/">char-encodings</a> project, which provides a C implementation of UTF-8 for Ruby in a manner which is <em>expected</em> to be compatible with Ruby 2.0.</p>
[link]
From: Keith Fahlgren (Oct 15 2006, at 21:04)
> "I’ve asked the ruby-talk mailing list why unpack doesn’t take a block; let’s see what they say."
Well, it's nice when the response includes the language designer saying, essentially: "that's interesting, I just implemented what you asked for and check it into HEAD"... Ruby is nice.
[link]
From: Venkat (Oct 15 2006, at 21:11)
Dear Tim:
I do enjoy reading your blog. Anything you have heard about the problems with ActiveSupport::Multibyte after your comment here? I am planning on using it at a later stage for one of the projects I am working on.
Thanks,
Venkat.
[link]
From: Ken Pollig (Oct 16 2006, at 07:34)
A way to process each char is:
require 'encoding/character/utf-8'
u("utf8string").each_char { |c| puts c }
This seems to work pretty well although gem check -a warns:
character-encodings-0.2.0 has 1 problems
/usr/lib/ruby/gems/1.8/cache/character-encodings-0.2.0.gem:
Unmanaged files in gem: ["lib/encoding/character/utf-8/utf8.bundle", "lib/encoding/character/utf-8/unicode.h", "ext/encoding/character/utf-8/mkmf.log", "ext/encoding/character/utf-8/Makefile", "ext/encoding/character/utf-8/gem_make.out"]
Any ideas what went wrong?
[link]