Keeping it Small and Simple

2008.01.02

Extending Ruby’s String#downcase to handle more letters

Filed under: Ruby Programming — Tags: — Lorenzo E. Danielsson @ 00:41

If you need to work with strings that contain letters like å, ä and ö (which are used in Swedish) or ɛ, ɔ and ŋ (which are used in Ga) then you already know that Ruby’s case-conversion methods (String#capitalize, String#downcase and String#upcase) don’t work the way you want.

Here is a small program that tries to convert two strings to lower case letters. The first is the name of a world-famous (yes, it is, where have you been?) song from Dalarna, Sweden. The second is the name of October in Ga.


1 #! /usr/bin/ruby
2
3 puts "JA’ VILL HA KÖRV!".downcase
4 puts "ANTƆŊ".downcase

When I try to run this program I get the following output:

% ruby not-what-i-want.rb
ja' vill ha kÖrv!
antƆŊ

Hm.. notice how some of the letters didn’t get downcased. Thank #{DEITY || “goodnes”} we happen to be using Ruby which was created specifically to make life pleasant! If the defaults don’t suit us, we alter them. In this case we modify the String class to behave according to our definition of good behavior.


 1 #! /usr/bin/ruby
 2
 3 class String
 4   def downcase
 5     self.tr A-ZÅÄÖƐƆŊ, a-zåäöɛɔŋ
 6   end
 7 end
 8
 9 puts "JA’ VILL HA KÖRV!".downcase
10 puts "ANTƆŊ".downcase

Let’s see how it works:


% ruby what-i-want.rb
ja' vill ha körv!
antɔŋ

Simple, eh? You can use the same principle on String#upcase and String#capitalize as well.

About these ads

4 Comments »

  1. Well understood. But permit me to ask a question ?
    Its about the self thing in the code. What is referring to? Is it the string, the current method or what? A little light on this issue will do some of us some good.

    Comment by bbaka — 2008.01.04 @ 21:07

  2. In ruby, self refers to the current instance of the object, just like “this” does in Java and C++. So, for instance, in the code above, self refers to the current String object.

    Of course, in my sample, self is not needed. I could have left the “self” part out and the code would have worked equally well. Occasionally I add a redundant self when I think it leads to greater clarity (in the example, I wanted to point out that tr is a method of the String class itself).

    Comment by Lorenzo E. Danielsson — 2008.01.05 @ 00:23

  3. are you sure TR works with unicode chars? I’m sure it didn’t work for me the last time I tried!

    Comment by David — 2008.02.17 @ 21:43

  4. The results you see are directly yanked-and-put from my terminal output. I’m very bad at character sets, but the six I use are all part of ISO 8859, or? I know å, ä and ö are at least.

    I think you’ll get into trouble once you go beyond the first 256 characters in the Unicode set. So, for example 漢字 will give you problems.

    But don’t take my word for any of this. To me character sets are mostly black magic. The above code works for me. The particular problem I needed to solve was limited to the Ga alphabet which extends the English one with ɛ, ɔ and ŋ.

    Comment by Lorenzo E. Danielsson — 2008.02.17 @ 22:15


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: