Ruby String Walking

In preparation for my presentation next weekend at RubyConf, I’ve been poking around at Ruby’s string-handling. One thing that text-wranglers such as me like to do is walk through a string a character at a time, and Ruby doesn’t make this particularly easy. I ended up implementing String#each_char_utf8 three times along the way. [Update: Lots of interesting feedback, and a worth-reading ruby-talk thread.]

I poked around in the ActiveSupport::MultiByte (I can’t get it to work on my computer, but I can read the source). It appears that the only way to look into a Ruby string and see Unicode characters is with unpack('U*'), or using a regexp with $KCODE set to 'u'.

So, I want to go walking through a string, looking at the Unicode characters. Maybe I’m parsing a big XML file using mmap(2) or some such. What I want is an efficient String#next_char. This will be hard in Ruby in the general case because Strings don’t know what encoding they’re in; there’s $KCODE but that’s only defined to work with regular expressions. So let’s look at the special case of UTF-8.

Of course, if String#unpack took a block like String#gsub, that would give you the tools you need. I looked at pack.c and it would be real work, but it doesn’t look architecturally impossible. Failing that, let’s use unpack anyhow:

def each_utf8_unpack(s, &block)
  s.unpack('U*').each &block
end

The above sucks for big strings because you create a monster array of integers. Regular expressions are maybe a little more efficient (this depends on the $KCODE setting, obviously):

def each_utf8_regex(s)
  s.gsub(/./m) do |c|
    yield c.unpack('U').shift
    ''
  end
end

The unpack voodoo is because I want the integer value of the Unicode character.

Here is a more ambitious version, extending String and picking the UTF-8 apart a byte at a time:

class String

  def each_char_utf8 
    @utf8_index = 0
    while @utf8_index < length
      yield next_utf8_char
    end
  end

  def next_byte
    b = self[@utf8_index]
    @utf8_index += 1
    return b
  end

  def first_utf8
    b = next_byte
    if    b & 0x80 == 0    then return 1, b
    elsif b & 0xe0 == 0xc0 then return 2, b & 0x1f
    elsif b & 0xf0 == 0xe0 then return 3, b & 0x0f
    else                        return 4, b & 0x07
    end
  end

  def next_6bits
    next_byte & 0x3f
  end

  def next_utf8_char
    len, c = first_utf8
    case len
    when 2
      c = (c << 6) | next_6bits
    when 3
      c = (c << 12) | (next_6bits << 6) | next_6bits
    when 4
      c = (c << 18) | (next_6bits << 12) | (next_6bits << 6) | next_6bits
    end
    return c
  end

end

I’m pretty sure the above has more or less the right semantics; but it’s a candidate for implementation in C (or Java, for JRuby).

I tested the performance by running all three versions over 2,000,000 bytes of ongoing text, containing a few thousand non-ASCII characters, and doing a character frequency count. Both the regex version and the byte-at-a-time version took around 18 seconds on my PowerBook; the unpack version took less then 5.

I’ve asked the ruby-talk mailing list why unpack doesn’t take a block; let’s see what they say.

Contributions

Comment feed for ongoing:

From: Jerome Lacoste (Oct 14 2006, at 11:20)

Got curious, found the link to the ruby-talk thread...

http://thread.gmane.org/gmane.comp.lang.ruby.general/177147/focus=177147

[link]

From: lars (Oct 14 2006, at 12:25)

There are some more options you may want to test:

Adding UTF8 methods to class String in Ruby

http://www.bigbold.com/snippets/posts/show/2786

Parsing UTF-8 encoded strings in Ruby

http://www.bigbold.com/snippets/posts/show/1659

Processing each character in an UTF8 string you could use something like:

utf8string.scan(/./u) { |c| puts c }

Cheers,

lars

[link]

From: Nick Munto (Oct 14 2006, at 14:31)

How about this approach:

utf8string.scan(/./u) { |c|

puts c

puts c.inspect

puts c[0]

}

[link]

From: Ceri Storey (Oct 15 2006, at 05:07)

I'd heartily recommend the use of <a href="http://rubyforge.org/projects/icu4r/">ICU4R</a>, a wrapper for IBM's <a href="http://icu.sourceforge.net/">International Components for Unicode</a>, which essentially gives you a <code>UString</code> type which is convertible to and from the native <code>String</code> type. Duck typing aside, this distinction can help to ensure that you don't end up with doubly encoded utf-8, and other nonsense. Otherwise, it gives you useful things like localised collation, and the ability to break down a string into glyphs (so you end up with a base character, and following combining characters).

[link]

From: Dominic Mitchell (Oct 15 2006, at 10:46)

Ceri: icu4r works great, but I had heard rumours that it's unmaintained. I haven't checked though.

Tim: You might want to also investigate the <a href="http://rubyforge.org/projects/char-encodings/">char-encodings</a> project, which provides a C implementation of UTF-8 for Ruby in a manner which is expected to be compatible with Ruby 2.0.

[link]

From: Keith Fahlgren (Oct 15 2006, at 21:04)

> "I’ve asked the ruby-talk mailing list why unpack doesn’t take a block; let’s see what they say."

Well, it's nice when the response includes the language designer saying, essentially: "that's interesting, I just implemented what you asked for and check it into HEAD"... Ruby is nice.

[link]

From: Venkat (Oct 15 2006, at 21:11)

Dear Tim:

I do enjoy reading your blog. Anything you have heard about the problems with ActiveSupport::Multibyte after your comment here? I am planning on using it at a later stage for one of the projects I am working on.

Thanks,

Venkat.

[link]

From: Ken Pollig (Oct 16 2006, at 07:34)

A way to process each char is:

require 'encoding/character/utf-8'

u("utf8string").each_char { |c| puts c }

This seems to work pretty well although gem check -a warns:

character-encodings-0.2.0 has 1 problems

/usr/lib/ruby/gems/1.8/cache/character-encodings-0.2.0.gem:

Unmanaged files in gem: ["lib/encoding/character/utf-8/utf8.bundle", "lib/encoding/character/utf-8/unicode.h", "ext/encoding/character/utf-8/mkmf.log", "ext/encoding/character/utf-8/Makefile", "ext/encoding/character/utf-8/gem_make.out"]

Any ideas what went wrong?

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

October 13, 2006
· Technology (90 fragments)
· · Ruby (93 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!