tisdag, augusti 26, 2008

Moving away from blogspot

This will be my last post on this blog. For several reasons I like the idea of keeping more in control over my blog and the environment surrounding it. I also have some things I'd like to publish that isn't well suited for the blog format, and moving to another location means that I can keep all my content in the same place. More long term I'm planning on migrating information about my open source projects there to.

But what you need to know is this: This blog ends. A new blog is born. All my old entries have been migrated. The important addresses for the new blog is:
And that's it. The new content will obviously be available at http://olabini.com, but right now this site just redirects to the blog.

The blog is dead, long live the blog.

måndag, augusti 25, 2008

ThoughtWorks Sweden is available

I would like to announce that ThoughtWorks Sweden is now in motion. We have business cards and an office. Everyone is returning from their long lovely Swedish summer vacations.

This means that ThoughtWorks Sweden is ready, and available for work. If you or your business have a project you need help with, don't hesitate to contact me (at obini@thoughtworks.com) or Marcus Ahnve (at mahnve@thoughtworks.com).

We are located in Stockholm, but we are open for work anywhere in the Nordic regions.

So what kind of work are we most suited for? Our sweet spot is in delivery and technical advisory regarding Java, Ruby and JRuby. And if you're interested in understanding how our Agile approach can change your company, we can do organizational transformation projects and also coaching and advisory.

Don't hesitate to get in touch!

söndag, augusti 17, 2008

JtestR 0.3.1 Released

JtestR allows you to test your Java code with Ruby frameworks.

Homepage: http://jtestr.codehaus.org
Download: http://dist.codehaus.org/jtestr

JtestR 0.3.1 is the current release of the JtestR testing tool. JtestR integrates JRuby with several Ruby frameworks to allow painless testing of Java code, using RSpec, Test/Unit, Expectations, dust and Mocha.

Features:
- Integrates with Ant, Maven and JUnit
- Includes JRuby 1.1, Test/Unit, RSpec, Expectations, dust, Mocha and ActiveSupport
- Customizes Mocha so that mocking of any Java class is possible
- Background testing server for quick startup of tests
- Automatically runs your JUnit and TestNG codebase as part of the build

Getting started: http://jtestr.codehaus.org/Getting+Started

New in the 0.3.1 release is upgrade of JRuby to revision r7479 which includes several new Java Integration features, upgrading of ActiveSupport to 2.1.0, fixing a severe memory leak in the background server and some minor usability features.

New and fixed in this release:
JTESTR-50 Difference in functionality when stubbing a method on a Java class vs a Ruby class using mocha
JTESTR-51 Mocking of classes lacking default constructors results in a NameError
JTESTR-53 Push the JtestR JRuby builds to maven repos
JTESTR-56 Upgrade ActiveSupport
JTESTR-57 Make it possible to use local versions of libraries.
JTESTR-59 No output when no tests found.
JTESTR-60 OutOfMemoryError
JTESTR-61 Documentation improvments - ant test-server
JTESTR-62 Having the jtestr.jar in the base directory doesn't work
JTESTR-63 Update JRuby version

torsdag, augusti 14, 2008

Where is the Net::SSH bug

Yesterday I spent several hours trying to find the problem with our implementation of OpenSSL Cipher, that caused the Net::SSH gem to fail miserable during negotiation and password verification. After various false leads I finally found the reason for the strange behavior. But I really can't decide if it's a bug, and if it's a bug where the bug is. Is it in Ruby's interface to OpenSSL, or is it in Net::SSH?

No matter what cipher suite you use for SSH, you generally end up using a block cipher, mostly something like CBC. That means an IV (initialization vector) is needed, together with a key. The relevant parts of OpenSSL used is the EVP_CipherInit, EVP_CipherUpdate and EVP_CipherFinal family of methods. Nothing really strange there. The Ruby interface matches these methods quite closely; every time you set a key, or an IV, or some other parameter, the CipherInit method is called with the relevant data. When CipherUpdate is called, the actual enciphering or deciphering starts happening, and CipherFinal takes care of the final block.

At the point EVP_CipherFinal is called, nothing more should be done using the specific Cipher context. Specifically, no more Update operations should be used. The man page has this to say about the Final-methods:
After this function is called the encryption operation is finished and no further calls to EVP_EncryptUpdate() should be made.

Now, what I found was that same documentation is not part of the Ruby interface. And Net::SSH is actually reusing the same Cipher object after final has been called on it. Specifically, it continues the conversation, calling update a few times and then final. The general flow for a specific Cipher object in Net::SSH is basically init->update->update->final->update->update->final.

So what is so bad about this then? Well, the question is really this: what IV will the operations after the first final call be using? The assumption I made is that obviously it will use the original IV set on the object. Something else would seem absurd. But indeed, the IV used is actually the last IV-length bytes of encrypted data returned. Is this an obvious or intended effect at some level? Probably not, since the OpenSSL documentation says you shouldn't do it. The reason it works that way is because the temporary buffer used in the Cipher context isn't cleared out at the end of the call to final.

In contrast, the Java Cipher object will call reset() as part of the call to doFinal(). Where reset() will actually reset the internal buffers to use the original IV. So the solution is simple for encryption. Just save away 8 or 16 bytes of the last generated crypto text and set that manually as the IV after the call to doFinal. And what about decryption? Well, here the IV needs to be the last crypto text sent in for deciphering, not the result of the last operation.

So Net::SSH seems to work fine with JRuby now. I'm about to release a new version of JRuby-OpenSSL including these and many other things.

But the question remains. Is it a bug? If it is, is it in the Ruby OpenSSL integration, or in the Net::SSH usages of Ciphers? If it's in the Net::SSH code, why does it actually work correctly when communicating with an SSH server? Or is this behavior of using the last crypto text as IV something documented in the SSH spec?

Enlightenment would be welcome.

söndag, augusti 10, 2008

Security vs Convenience

I really like Cryptogram and read every issue. It's interesting stuff that talks a lot about how our minds work in conjunction with risk and reward. Today I had a typical example of how security versus convenience is a part of day to day life.

I had just checked out from my hotel, and wanted to store all my luggage (including my laptop bag) in the hotel until my ride out of town arrived. I asked about this, and it was fine, they had a room for this. The person in the reception pointed me to an open room and said it was open and that I could put my stuff there. Feeling uneasy I asked how secure it was, and she answered that the door was usually locked. OK, I said, but can someone take any bag from inside of there? Yes, was the answer. I decided I couldn't store my stuff there. Even if the risk was small, losing my work laptop would be way to bad to risk. But I also decided I couldn't drag my two heavy bags and laptop bag around.

I ended up putting the large bags in the room, and just taking my laptop bag around. I didn't have as much to lose with the large bags, and the price of inconvenience in taking them along was just to high. These considerations go into everything we do in programming and systems engineering. A totally secure system is generally quite inconvenient to use, while an insecure system can be very pleasant to use. The trick is to get the balance right, I guess.

JtestR doesn't start up.

Justin Smestad uncovered an issue with JtestR that can cause some quite unintuitive output, and be hard to debug. Some info can be found here: http://www.evalcode.com/2008/08/jtestr-woes/ and here: http://jira.codehaus.org/browse/JTESTR-62. The issue has been fixed on trunk, but hasn't been released yet. The issue is very simple - just make sure you don't have the jtestr.jar file in the base directory where your project lives (this is usually the same place as the build.xml file). There are two ways to achieve this, either move the file into a directory or rename the file to something else.

fredag, juli 04, 2008

Java and mocking

I've just spent my first three days on a project in Leeds. It's a pretty common Java project, RESTful services and some MVC screens. We have been using Mockito for testing which is a first for me. My immediate impression is quite good. It's a nice tool and it allows some very clean testing of stuff that generally becomes quite messy. One of the things I like is how it uses generics and the static typing of Java to make it really easy to make mocks that are actually type checked; like this for example:
Iterator iter = mock(Iterator.class);
stub(iter.hasNext()).toReturn(false);

// Call stuff that starts interaction

verify(iter).hasNext();
These are generally the only things you need to stub stuff out and verify that it was called. The things you don't care about you don't verify. This is pretty good for being Java, but there are some problems with it too. One of the first things I noticed I don't like is that interactions that isn't verified can't be disallowed in an easy way. Optimally this would happen at the creation of the mock, instead of actually calling the verifyNoMoreInteractions() afterwards instead. It's way to easy to forget. Another problem that quite often comes up is that you want to mock out or stub some methods but retain the original behavior of others. This doesn't seem possible, and the alternative is to manually create a new subclass for this. Annoying.

Contrast this to testing the same interaction with Mocha, using JtestR, the difference isn't that much, but there is some missing cruft:
iter = mock(Iterator)
iter.expects(:hasNext).returns(false)

# Call stuff that starts interaction
Ruby makes the checking of interactions happen automatically afterwards, and so you don't have any types you don't need to care about most stuff the way you do in Java. This also shows a few of the inconsistencies in Mockito, that is necessary because of the type system. For example, with the verify method you send the mock as argument and the return value of the verify-method is what you call the actual method on, to verify that it's actually called. Verify is a generic method that returns the same type as the argument you give to it. But this doesn't work for the stub method. Since it needs to return a value that you can call toReturn on, that means it can't actually return the type of the mock, which in turn means that you need to call the method to stub before the actual stub call happens. This dichotomy gets me every time since it's a core inconsistency in the way the library works.

Contrast that to how a Mockito like library might look for the same interaction:
iter = mock(Iterator)
stub(iter).hasNext.toReturn(false)

# Do stuff

verify(iter).hasNext
The lack of typing makes it possible to create a cleaner, more readable API. Of course, these interactions are all based on how the Java code looked. You could quite easily imagine a more free form DSL for mocking that is easier to read and write.

Conclusion? Mockito is nice, but Ruby mocking is definitely nicer. I'm wondering why the current mocking approaches doesn't use the method call way of defining expectations and stubs though, since these are much easier to work with in Ruby.

Also, it was kinda annoying to upgrade from Mockito 1.3 to 1.4 and see half our tests starting to fail for unknown reasons. Upgrade cancelled.

fredag, juni 27, 2008

JtestR, RubyGems, and external code

One question I've gotten a few times now that people are starting to use JtestR, is how to make it work with external libraries. This is actually two different questions, masquerading as one. The first one regard the libraries that are already included with JtestR, such as JRuby, RSpec or ActiveSupport. There is an open bug in JIRA for this, called JTESTR-57, but the reason I've been a bit hesitant to add this functionality until now, is because JtestR actually does some pretty hairy things in places. Especially the JRuby integration does ClassLoader magic that can potentially be quite version dependent. The RSpec and Mocha integration is the same. I don't actually modify these libraries, but the code using them is a bit brittle at the moment. I've worked on fixing this by providing patches to the framework maintainers to include the hook functionality I need. This has worked with great success for both Expectations and RSpec.

That said, I will provide something that allows you to use local versions of these libraries, at your own risk. It will probably be part of 0.4, and if you're interested JTESTR-57 is the one to follow.

The second problem is a bit more complicated. You will have seen this problem if you try to do "require 'rubygems'". JtestR does not include RubyGems. There are both tecnnical and non-technical reasons for this. Simply, the technical problem is that RubyGems is coded in such a way that it doesn't interact well with loading things from JAR-packaged files. That means I can't distribute the full JtestR in one JAR-file if I wanted RubyGems, and that's just unacceptable. I need to be able to bundle everything in a way that makes it easy to use.

The non-technical reason is a bit more subtle. If RubyGems can be used in your tests, it encourages locally installed gems. It's a bit less pain to do it that way initially, but remember that as soon as you check the tests in to version control (you are using version control, right?) it will break in unexpected ways if other persons using the code doesn't have the same gems installed, with the same versions.

Luckily, it's quite simple to work provide functionality to JtestR, even if no gems are used. The first step is to create a directory that contains all the third party code. I will call it test_lib and place it in the root of the project. After you have done that you must first unpack your gems:
mkdir test_lib
cd test_lib
jruby -S gem unpack activerecord
When you have the gems you want unpacked in this directory, you can add something like this to your jtestr_config.rb:
Dir["test_lib/*/lib"].each do |dir|
$LOAD_PATH << dir
end
And finally you can load the libraries you need:
require 'active_record'

lördag, juni 21, 2008

TheServerSide Java Symposium Europe is over

Well, I'm home from Prague, from another edition of TheServerSide Java Symposium. This year was definitely a few notches up from last year in Barcelona in my opinion. And being in beautiful Prague didn't really cause any trouble either. =)

I landed on Tuesday, and worked quite heavily on my talks. Due to the ThoughtWorks AwayDay I was really out in the last second with my two slide decks. But I still got to see parts of the city in the evening. Very nice.

I managed to sleep over the opening keynote, but dragged myself down to the main room to watch the session on Spring Dynamic Modules. This ended up being more about OSGi style things than really dynamic things, so I felt a bit cheated, and kept on working on my slides instead. Before lunch I sat in on Alex Popescu's talk about scripting databases with Groovy. Over all a very good overview of the database landscape from a Groovy point of view, including just using the language to make the JDBC API's more flexible, building a builder style DSL for working with SQL, to the full blown GORM framework. All in all quite nice. But the funniest part was definitely peoples reaction to the SQL DSL, where most in the room preferred the real SQL to the Groovy version.

After lunch I had planned to see the session that compared different dependency injection frameworks, but the speaker never showed up, so I found myself listening to info about JSR-275, that provides support for units in a monetary system. Quite useful if you're working in that domain, but at the same time it felt like this would look so much cleaner in Ruby. Of course, that's how I react to most Java code nowadays.

Holly Cummins gave a very good talk about Java Performance Tooling. Of course it was coming with a slight IBM slant, but that's fair. The tools built around their JVM is actually really good for identifying several kinds of performance problems. So I'm actually in a mind to try JRuby on the IBM JVM and see if we can glean some more interesting information from that.

Geert gave his Terracotta talk about JVM clustering, and it's really interesting if you haven't seen it before. In this case I took the opportunity to listen while working on my slides.

And that was the end of day one.

Day two I was a good boy and was actually up in time for the keynote. This might have something to do with the fact that it was Neal Ford giving it, and he talked about Language-Oriented Programming. This is one of my favorite topics, and I'd only seen his slides to this talk before, not heard him give it. If you've been following the discussions about polyglot programming, the content made lots of sense. If you don't believe in polyglot programming, you might have been convinced.

After the keynote, it was time for breakfast, so I didn't see the sessions in that slot. After breakfast I sat in on Guillaume's Groovy in the Enterprise: Case Studies. While the presentation were good, he spent more than half of it just giving an introduction to Groovy. I'm not one to throw stones in glass houses, though, so I have to admit that this is something I can be found guilty of too. I'm trying to improve on this though. It makes a disservice to the audience - if they have to sit through the same kind of intro they might already have seen to get to the actual meat. That's one of the reasons I tried to minimize introductionary material in my testing session.

It was also in this session that a slide with the words "Groovy is the fastest dynamic language on the JVM" showed up. That's based only on the Alioth benchmarks, and it doesn't actually matter if it's true or not. It's a disservice to the audience. Especially in this case where even if Groovy actually on average is faster than JRuby, we are talking maximum 1-2% in average. The speed differences aren't really why you would be interested in using such a language, and in my opinion Groovy has got lots of other interesting features you can use to sell and market it. In summary, it felt a bit unnecessary.

Directly after that session, me, Ted Neward, and Guillaume was featured in a panel on the languages of the next generation. Eugene Ciurana who was supposed to moderate didn't really show up, so John Davies and Kirk Pepperdine had to jump in instead. It ended up being quite fun, but no real heat in the discussion. In something like this, I think it would be useful to have someone with different views to spice it up. Me, Ted and Guillaume just agree about these things way too much. But we got some nice Czeckian vodka. That was good. =)

After lunch I spent more time prepping my talk, and then it was finally time to give it. This was the JRuby on Rails introduction, and it ended up being quite nice. I had a good turn-up, and interestingly enough, many in the audience had actually tried Ruby already.

After my session was up, I could relax, so I went to Kirks talk about Concurrency and High Performance, which included many things to think about while working on the performance of an enterprise scale application. Very useful material.

Finally, at the end of the day it was time for the fireside chats, which is basically another word for BOFs. I sat in on the Zero Turnaround in Java Development session, which ended up not being as much discussion as I had expected, and more talking about the three principals different approachs (RIFE, Grails and JavaRebel).

The Fireside Performance Clinic was good fun, and some useful material. In particular, knowing whether JRuby startup time is CPU or IO bound is something I have never thought about, and might yield some interesting insights.

Day three felt a bit slower, as the last day usually does. The first session for me was Ted's Scala talk. I've seen it a few times before, but the most interesting part is actually the audience questions. As usual I wasn't disappointed. And Ted did his regular thing and weaved me into the examples. One of the more funny bits were when he was explaining the differences between var and val in Scala, and he decided that it might be good to be able to switch my surname. Then came the killer, where he said something like this: "well, and you might want to change the surname of Ola. Since Ola was just married, congratulations by the way, and he's from Sweden where the husband generally takes the surname of the wife, so we need to change his surname". At that point I had a hard time keeping it together.

The session on what's new and exciting in JPA 2 ended up not exciting me at all, so frankly I don't remember anything at all about that. I have vague blurry images of many at-signs.

Shashank Tiwari gave a presentation on how to choose your web framework, and this generated some discussion that were quite interesting. At this point I still wasn't finished with the examples for my testing session though, so I had to work on them. And I finally managed to finish it. Because lo, at that time I did the presentation on testing with JRuby. I spent some time on the different Ruby testing frameworks, first showing off how you can test Ruby code with them. Then I switched the model to a Java class, and used basically the same tests again. The cutest example is probably my story about a Stack. Not a literary master piece, but it's still prose.

People seemed to like the session and get something out of it, and that feels great since this was the first time I showed JtestR to a larger group of people. My mocking domain consisting of Primates, Food and Factories also seemed to go home. I got the expected laughs at the source code line where a Chimpanzee tries to eat Tuna and "throw new Up();".

Typesafe Embedded Java DSLs basically talked about how you can use the standard generic builder patterns to create DSLs that your IDE can help you quite much with. Sadly, my computer decided to give me a heart attach during this presentation, so I had to run out and give it CPR instead of sitting in on the rest of the session.

And that was TSSJS-E. For me, the first day was quite weak, but the content of the other two days were definitely extremely good. I can recommend it to anyone next year.

onsdag, juni 18, 2008

Testing programming language implementations

While writing the post yesterday about testing regular expressions, I realized that this problem is not really specific to regular expressions. I got a very good comment noting that testing any place that uses some kind of DSL is definitely prudent. SQL is another example.

But these examples are both about actually testing the usage of them, and the problem becomes that you have two languages, but you're mostly only testing the code written in the outer language. This is due to several reasons. One of the most obvious ones is that our tools really doesn't make it that easy to do.

Thinking about these issues made me start thinking about how we generally test languages. Having worked on several language implementations and worked on both new languages, and implementations of existing languages, I've come to the conclusion that the whole area of testing languages are actually quite complicated, and also there are no real best practices for doing it.

First, there is a problem of terminology. Many implementations of languages that are really executable specifications of how the language should work. What's the difference? Well, testing the language according to such a spec, you are really only doing functional, black-box testing. I've looked at several of the open source language implementations, and I don't really see much usage of anything else than such language spec tests. This means basically that some parts of the implementation can be implemented wrongly, and by some freak chance it still works correctly in all the cases you have tests for, but it might fail in other ways.

Unit tests for the actual implementation would help with this - it helps since you will be doing TDD on the unit level, it helps because you make a conscious decision about the implementation and what it should be doing in these cases. It still doesn't make everything clear cut and simple, but it absolutely would help. So why don't most implementations do unit testing of the internals? I don't really know. Maybe it's because implementations can be extremely complicated. But that should be a reason for testing more, not testing less. One reason I feel a bit about is that it makes larger changes quite hard. Large refactorings are one of the ways JRuby has used to get incredible performance improvements and new subsystems, but unit tests can sometimes act as inertia for these.

I'm totally disregarding the academic approaches here. Yeah, in soem cases for simple languages, you can actually prove that it does what you want it to do, and for small enough implementations using a suitable language, you can actually prove the same things about the implementation. The problem is that this approach doesn't scale.

And since a language almost always is turing complete, that means that you can't exhaustively test it. There is no way of testing all permutations - either manually or automatically. So what should a language spec do? The first thing that many languages do are to specify that whole areas of functionality result in undefined behavior. That makes it easier. But the real problems exist when you start combining different features which can interact in different ways.

At the end of the day, I have no idea how to actually do this well. I would like to know though - how should I test the implementation, and how should I write an executable language specification? And these questions doesn't even touch on the question of testing the core libraries. Many of the some problems apply, but it gets even more complicated.

Local things in Emacs

This is just a small note, since this have bugged me for a while. Basically, I have lots of extra key bindings running around in my Emacs configuration. Now, I use local-set-key for many of these. The problem is I hadn't actually read the documentation for local-set-key enough.

One example that annoyed me was this: I had some local key bindings for RSpec buffers, that differed from the regular Ruby buffers. My RSpec minor mode still uses the ruby-mode-map though. My assumption was that local-set-key did things exactly as all other things with "local" in their name, namely doing a buffer local modification only. I finally found out that this wasn't the case. Instead, when the RSpec minor mode was loaded for the first time, it ended up modifying the ruby-mode-map with its key bindings, which were then visible for all other Ruby buffers. Ouch.

So, if you use local-set-key, make sure you actually want to set that key in the current mode map, instead of only for the current buffer.

As far as I know, there is no way to set a real buffer local key binding without some acrobatics that unsets and resets the keys manually. I ended up solving my problem with the RSpec minor mode to having it clone the Ruby mode map and have its own mode map. Not an ideal solution, but it works for now.

tisdag, juni 17, 2008

Testing Regular Expressions

Something has been worrying me a bit lately. Being test infected and all, and working for ThoughtWorks, where testing is part of the life blood, I think more and more about these issues. And one thing I've started noticing is that regular expressions seems to be a total blind spot in many cases. I first started thinking about it when I changed a quite complicated regular expression in RSpec. Now RSpec has coverage tests as part of their build, and if the test coverage is less than a 100%, the build will fail. Now, since I had changed something to add new functionality, but hadn't added any tests for it, I instinctively assumed that it would be caught be the coverage tool.

Guess what? It wasn't. Of course, if I had changed the regexp to do something that the surrounding code couldn't support, one of the tests for surrounding lines of code would have caught it, but I got no mention from the coverage tool that I needed more tests to fully handle the regular expressions. This is logical if you think about it. There is no way that a coverage tool could find all the regular expressions in your source code, and then make sure that all branches and alternatives of that particular regular expression was exercised. So that means that the coverage tool doesn't do anything with them at all.

OK, I can live with that, but it's still one of those points that would be very good to keep in mind. Every time you write a regular expression in your code, you need to take special care to actually exercise that part of the code with many inputs. What is many in this case? That's another part of the problem - it depends on the regular expression. It depends on how complicated it is, how long it is, how many special operators are used, and so on. There is no real way around it. To test a regular expression, you really need to understand how they work. The corollary is obvious - to use a regular expression in your code, you need to know how to test it. Conclusion - you need to understand regular expressions.

In many code bases I haven't seen any tests for regular expressions at all. In most cases these have been crafted by writing them outside the code, testing them by hand, and then putting them in the code. This is brittle to say the least. In the cases where there are tests, it's much more common that they only test positives, and not negatives. And I've seldom heard of code bases with enough tests for regular expressions. One of the problems is that in a language like Ruby, they are so easy to use, so you stick them in all over the place. A standard refactoring could help here, by extracting all literal regular expressions to constants. But then the problem becomes another - as soon as you use regular expressions to extract values from a string, it's a pain to not have the regular expression at the same place as the extracted groups are used. Example:
PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/
# 200 lines of code
if phone_number =~ PhoneRegexp
puts "phone number is: #$1-#$2-#$3"
end
If the regular expression had been at the same place as the usage of the $1, $2 and $3 it would have been easy to tie them to the parts of the string. In this case it would be easy anyway, but in more complicated cases it's more complicated. The solution to this is easy - the dollar numbers are evil: don't use them. Instead use an idiom like this:
area, number, extension = PhoneRegexp.match(phone_number).captures
In Ruby 1.9 you will be able to use named captures, and that will make it even easier to make readable usage of the extracted parts of a string. But fact is, the difference between the usage point and the definition point can still cause trouble. A way of getting around this would be to take any complicated regular expression and putting it inside of a specific class for only that purpose. The class would then encapsulate the usage, and would also allow you to test the regular expression more or less in isolation. In the example above, maybe creating a PhoneNumberParser would be a good idea.

At the end of the day, regular expressions are an extremely complicated feature, and in general we don't test the usage of them enough. So you should start. Begin by first creating both positive and negative tests for them. Figure out the boundaries, and see where they can go wrong. Know regular expressions well enough to know what happens in these strange circumstances. Think about unicode characters. Think about whitespace. Think about greedy and lazy matching. As an example of something that took a long time to cause trouble; what's wrong with this regexp that tries to discern if a string is a select statement or not?
/^\s*\(*\s*SELECT\W+/i
And this example actually covers most of the ground, already. It checks case insensitive. It checks for white space before any optional parenthesis, and for any white space after. It makes sure that the word SELECT isn't continued by checking for at least one non word character. So what's wrong with it? Well... It's the caret. Imagine if we had a string like this:
"INSERT INTO foo(a,b,c)\nSELECT * FROM bar"
The regular expression will in fact match this, even though it's not a select statement. Why? Well, it just so happens that the caret matches the beginning of lines, not the beginning of strings. The dollar sign works the same way, matching the end of lines. How do you solve it? Change the caret to \A and the dollar sign to \Z and it will work as expected. A similar problem can show up with the "." to match any character. Depending on which language you are using, the dot might or might not match a newline. Always make sure you know which one you want, and what you don't want.

Finally, these are just some thoughts I had while writing it. There is much more advice to give, but it can be condensed to this: understand regular expressions, and test them. The dot isn't as simple as it seem. Regular expressions are a full blown language, even though it's not turing complete (in most implementations). That means that you can't test it completely, in the general case. This doesn't mean you shouldn't try to cover all eventualities.

How are you testing your regular expressions? How much?

Applications and libraries

In a recent discussion around one of Steve Yegge's blog post, an incidental remark was that it's OK that a language makes it harder for a library creator than for an application developer. This point was made by David Pollak and Martin Odersky in relation to some of the complications that you need to handle when creating a Scala library that you can intuitively use without a full understanding of the Scala type system. Make no mistake, I have lots of respect for both Martin and David, it's just that in this case I think it's actually a quite damaging assumption to make. And they are not the only ones who reason like that either. Joshua Bloch's book Effective Java includes this assumption too, in many places.

So what's wrong with it then? Isn't there a difference between developing an application and a library. Yes, there is a difference, but it's definitely not as large as people make it out to be. And even more importantly: it _shouldn't_ be that much of a difference. The argument from David was that when creating a library in Scala, he needs to focus and work with quite complicated parts of the type system so that the consumer gets a nice API to use the library through. This process is much harder than just using the library would be.

Effective Java contains much good advice, but most of them are from the perspective of someone who creates libraries for a living, and there are a few places where Josh explicitly says that his advice isn't necessarily applicable when writing an application, since he doesn't have that point of view.

Let's take a look at a fundamental question then. What is actually a library, and what is an application? In my opinion, a library is a module providing functionality of some kind, restricted to a specific domain. This can be a horizontal or vertical domain, that doesn't matter, but it's usually something that is usable in more than one circumstance. It's not uncommon that libraries use other libraries to implements its functionality. An application is usually a collection of libraries that provide functionality to an end user. That end user can be either a person, a program or another computer - that doesn't matter. But wait, isn't libraries usually also created to provide functionality to other pieces of code? And even though libraries have a tendency to contain more specific code, and less usage of other libraries, the line is extremely fuzzy.

The way most applications seems to be built now, most of the work is done to collect libraries, provide the missing functionality and glue them together in some way. But that doesn't mean that the code you write in the application won't be used as a library by another consumer. In fact, it's more and more common to try to reuse as much as possible, and especially when you extend an existing application, it's extremely important that you can consume the existing functionality in a sane way.

So why make the distinction? Doing that seems to me to be an excuse for writing bad code if it's in an application. Why won't we as programmers admit that we don't know if someone else will need to consume the code later, and write the best code we can, including creating usable ad well thought out public APIs? Yes, the cost and time will be higher, but that's true for writing tests too. I don't see any value in arguing that libraries should be designed with more care than application code. In fact, I think that attitude is actively detrimental to the industry. And adding a language feature to a language that is complicated, and then arguing that only "library developers" will need to understand it is definitely not the right way to go. A responsible developer using a language needs to understand how that language works. Otherwise that developer will sooner or later cause a great mess. It's just a matter of time.

söndag, juni 15, 2008

JtestR 0.3 Released

JtestR allows you to test your Java code with Ruby frameworks.

Homepage: http://jtestr.codehaus.org
Download: http://dist.codehaus.org/jtestr

JtestR 0.3 is the current release of the JtestR testing tool. JtestR integrates JRuby with several Ruby frameworks to allow painless testing of Java code, using RSpec, Test/Unit, Expectations, dust and Mocha.

Features:
- Integrates with Ant, Maven and JUnit
- Includes JRuby 1.1, Test/Unit, RSpec, Expectations, dust, Mocha and ActiveSupport
- Customizes Mocha so that mocking of any Java class is possible
- Background testing server for quick startup of tests
- Automatically runs your JUnit and TestNG codebase as part of the build

Getting started: http://jtestr.codehaus.org/Getting+Started

The 0.3 release has focused on stabilizing Maven support, and adding new capabilities for JUnit integration.

New and fixed in this release:
JTESTR-47 Maven with subprojects should work intuitively
JTESTR-42 Maven dependencies should be automatically picked up by the test run
JTESTR-41 Driver jtestr from junit
JTESTR-37 Can't expect a specific Java exception correctly
JTESTR-36 IDE integration, possibility to run single tests
JTESTR-35 Support XML output of test reports

Team:
Ola Bini - ola.bini@gmail.com
Anda Abramovici - anda.abramovici@gmail.com

tisdag, juni 10, 2008

Ruby can't be good since I won't bother learning it...

Best quote this whole day, found in http://www.codinghorror.com/blog/archives/001131.html#comments.

If Ruby offered something new I would have learned it fine tbh... its just difficult enough to not be able to "pick up and run with" like almost everything else out there... but honestly, it wouldn't let me do anything I can't already do.

My brain almost exploded reading that.

onsdag, juni 04, 2008

Git completion in tcsh

So I've been a bit envious at the lovely git completion bash users have - but obviously I can't just switch to bash. Anyone who is in the same kind of situation might like the fact that I've started a project to provide this functionality for tcsh.

The first thing you need to do is download the source for tcsh 6.15, and apply the patch you can find here: http://bugs.gw.com/bug_view_advanced_page.php?bug_id=60. Without it it won't work. Compile and install the new tcsh version. The next step is to check out the project from github, at http://github.com/olabini/git_complete_tcsh. Make sure that git_complete is executable and on your path. You need to have Ruby installed for this, btw.

The final step is to modify your .cshrc to add something like this: complete git{,-*} 'p/*/`git_complete`/'.

Now git completion should work, although most of the commands aren't implemented yet. I'll get to them in time. The whole project is a port of the bash completion for git

tisdag, juni 03, 2008

Fractal Programming

This is a continuation of my previous posts describing layers of code written in different programming languages. I have thought about the things involved for a while, and had several discussions with people about it. There were some parts that I didn't describe as well as I thought in my posts, and I will try to do better in this one.

The core of these ideas are based on polyglot programming, the thinking that you should use several different languages in a project, based on which languages are better suited for different parts of it. Another term for this concept is Language-oriented programming. So how do you organize a polyglot system? The most natural way for me is to divide it into layers. In most cases you will find that different categories of languages will be better suited to different layers of the application.

In my original post I identified three layers that can be used to organize polyglot systems. These layers are the stable layer, the dynamic layer, and the domain layer. There are several reasons for organizing them this way, and I'll take a harder look at each of the layers further down. But first let me note that these layers are usually depicted in the form of a pyramid, with the stable layer being that base. That is definitely not how I think about it. In fact, I see it as an inverted pyramid, where the stable layer is the tip of the pyramid, providing the base. The Dynamic layer is the middle part. The domain layer should be the largest part and will very often include more than one dynamic language. So in my mind I represent the different domain languages as smaller pyramids standing upside down, covering the base area. Now, the dynamic layer can also be divided into smaller parts like this, based on language or functionality. This is a bounded fractal representation, which is the reason for the title of this blog post.

This diagram shows how I think about it:Of course, the smaller pyramids can be all the same language and system, or several different ones. It all depends on the application or system you are building. So you can for example use a combination of Ruby, Java and external or internal DSLs:Or you could use Clojure, Scala and JavaScript:
Or any other combination you can imagine. As long as the combination is what's best suited for the problem.

Let's take a look at the definitions of the different layers. There have been some discussion about the names I've chosen for them, so let me describe a little more what the responsibility of each part is, and why it's in that part of the system.

The Domain Layer
This layer is the simplest. This is where all the actual domain rules are defined. In general that means one or more domain specific languages. It doesn't really matter if they are internal or external. This model see them as the same layer. This part of the system is what needs to be malleable enough that it should be possible to change rules in production, allow domain experts to do things with it, or just plain a very complicated configuration. The languages used in this layer are mostly external DSLs, but can also include extremely DSL-friendly languages like Ruby, Python or Groovy.

The Dynamic Layer
Neal Ford argues that this layer isn't so uch about dynamic, as it is about essence. That was never my intention. The problem is that even if you take a language like Scala, which is usually classified as an essential language, Scala requires compilation. To me, compilation is ceremony, which means that it's one extra thing you don't want to care about when writing most of your application code. That's why this layer needs to be dynamic. This is where languages like Ruby, Groovy, Python, JavaScript, Clojure and others live.

The Stable Layer
I view the stable layer as the core set of axioms, the hard kernel or the thin foundation that you can build the rest of your system in. There is definitely advantages to having this layer be written in an expressive language, but performance and static type checking is most interesting here. There is always a tradeof in giving up static typing, and the point of having this layer is to make that tradeof smaller. The dynamic layer runs on top of the stable layer, utilizing resources and services provided.

Another important feature of this layer is that this is where all interfaces are defined. By interfaces I mean external API's. They need to be hard for other clients to be able to trust them. But the implementations for them lives in the dynamic layer, not in the stable. By doing it this way you can take advantage of static type information for your API's while still retaining full flexbility in implementation of them. Languages in the stable layer can be Java, Scala or F#. It should be fairly small compared to the rest of the application, and just provide the base necessary services needed for everything to function.

The most common objection I hear from people about this strategy is the same as for the general polyglot programming idea: if we have a proliferation of languages in a system, it will be harder to find skilled programmers who can work with it.

This objection is true to a degree, but there are several ways around it. First, I have to say that I don't believe this is such a big problem as many others think. Programmers nowadays depend on their tool chains quite heavily, all of them including many advanced features that takes lots of time to learn. But most programmers doesn't even view their languages as tools. In my mind, the programming language is the most important tool. And once we start using better languages for systems, many of the things we need other tools for will disappear or become less of a problem.

I tend to believe that programming languages are quite easy to learn as soon as you understand the fundamental building blocks of programming languages. And if you don't have a fair understanding of these building blocks, I would say that you probably aren't using your current language as well as you should either. I see this as part of being responsible programmers.

I also believe quite strongly that if we used better languages for our code, many code bases would be smaller, easier to understand, easier to maintain and cost less - which means you could afford to find a more skilled programmer to do the work for you. This would mean that both parties win - the programmer gets more interesting work and better code, while the client gets more worth for his money in less time.

RailsConf 2008

I've landed, gotten mostly back in the right timezone without too many incidents (except running through SFO to board very badly scheduled connection).

After allowing the impressions from the last 6-7 days to sink in a little, it's time to summarize RailsConf. I'll go through the sessions I saw and then do some concluding remarks.

The first day was tutorials. I had a good time in Neal Fords and Pat Farleys tutorial on Metaprogramming. I can't say I learned much from the sessions, but it was very good content, extremely well presented, and I got the impression that many in the room learned lots of crucial things. The kind of knowledge about internals you get from a talk like this allows you to understand how metaprogramming in Ruby actually works, which makes it easier to achieve the effects you want.

After that I sat around hacking in the Community Code Drive for the rest of the day, with lots of other people. I wasn't involved in gitjour (which by the way is incredibly cool), but I did manage to find a memory leak in iTerms Bonjour handling due to gitjour. Neat. Me and David Chelimsky paired on getting support for multiline plain text story arguments into RSpec, and by the end of the afternoon it was in.

Finally, we headed out to the JRuby hackfest, which ended up being over full with people. That's a good problem to have. We had a great time, hacking on different things, helping people to get started and debugging various problems. All in all it was a very productive day.

I began the Friday with Joel Spolsky's keynote. In contrast to many other people I didn't like it. There wasn't really any content at all, just some humorous content and lots of jokes about naked women. I expect something a bit more profound for the first keynote of the conference, since they have a tendency to actually set the standard for the rest of the days.

After the keynote, John Lam showed off IronRuby running a few simple Rails requests. This is a great achievement, and I'm very impressed with their results. I have argued that IronRuby would probably never reach this point, and I'm very happy to admit I was wrong and offer my apologies to John Lam and the IronRuby team. That said, the fact that IronRuby runs a few different Rails requests is not the same thing as saying that IronRuby runs Rails. My personal definition of running Rails is more about having the Rails test suite run at a high percentage of success (something like 96-98% would be good enough for almost all Rails apps to work, provided they are the right 98%). (ED: Evan Phoenix just told me that MRI doesn't run the Rails test suite totally clean either, because of the way the Rails development process works. So a 100% is probably not a good measure of Rails compatibility.) I assume that this is going to be the next goal for the IronRuby team, and I wish them good luck.

I saw the Hosting talk after that, but I have to admit I was wrapped up in a seriously annoying JRuby bug at the moment so I didn't really pay attention.

The DataMapper talk was very full and gave a good overview of why DataMapper might be a better choice than AR in many cases. The presentation style could possibly have been a bit less dry, but the content was definitely delicious.

If the next two days were the JRuby days, the Friday was the day for all other alternative implementations. I sat in on the Rubinius talk by Evan Phoenix and friends, and then the much talked about MagLev presentation.

I first want to congratulate Rubinius on running several different Rails requests. It's very cool and a great milestone. The same caveats as for IronRuby applies of course. But wow, the debugging features is awesome. First class meta objects are extremely powerful, and will provide many capabilities to the platform. The presentation was also extremely entertaining. One of the best presentations for the sheer fun everyone seemed to have. Props to Evan, Brian and Wilson for this.

So. The MagLev talk. First, there seems to be some misunderstandings about what MagLev actually is. It is not a hosting service. Gemstone might offer a hosting service around MagLev in the future, but that's not what is going on here. MagLev is a new virtual machine for Ruby, based on Gemstone/S. Basing it on a Smalltalk machine makes it very easy for Gemstone to implement a large subset of Ruby and having it running cleanly and with good performance. Exactly how much has been implemented at this point is not really clear, since no major applications run, and the RubySpecs have not been used on it yet. I assume that the implementation doesn't handle enough Ruby features yet to be able to run the mspec runner and other important machinery.

Was this presentation important? Yeah, sure. To a degree. It was a cool presentation, whetting peoples appetite by showing something that might some day become a real Ruby platform with built in support for an incredible OODB. But it's still early days.

The Saturday began with Jeremy's keynote. He talked about the new things in Rails 2.1 and also showed the same app running in Ruby 1.8, 1.9, Rubinius and JRuby. Very cool.

I ended up in Nathaniel Talbotts 23 Hacks session which was fun. Good stuff.

After that the JRuby day began in earnest with Nick's talk about deploying JRuby on Rails. This was mostly the same talk as given at JavaOne, but more geared towards Ruby programmers. Useful information.

Dan Manges and Zak Tamsen gave an extremely useful talk about how to test Rails applications correctly. Very good material. Exactly the strong kind of deep technical knowledge, gained by experience, that people go to conferences to get.

My talk about JRuby on Rails was generally well received. I had a fun time, and of course I managed to run out of time as usual. I wonder why I'm always afraid of running out of material. That has never happened when I'm talking bout JRuby.

The final technical session of the day ended up being a walk-around to all the different presentations going on and taking a peek, and then ending up hacking in the speakers room.

The evening keynote was by Kent Beck, and as usual he is fantastic to listen to.

The Sunday started with the CS nerds anonymous session, held by Evan Phoenix. It ended up being a kind of lightning talk session, and had some nice points.

After that Ezra gave his talk - that had nothing to do with the session title. He presented Vertebra, which is a cloud computing control system, based on XMPP, Erlang and the actors model. Very cool stuff, although it might not be that useful for people who aren't in charge of a quite large number of computers. But if you have your own botnet, this might be the best way to control them all. =)

The final session of the day was the JRuby Q&A session, which basically flew by. The first ten minutes went in normal time, and then suddenly the session was over. I think we had good attendance, and the right level of questions. You can see all the points covered in Nicks blog, here.

And then it was over.

So, what was good? The technical level was definitely deeper and more rooted in experience. I have to say that this was probably the best Ruby conference I've been to, based on the depth and level of the presentations. Kudos to the scheduling people.

And what was bad? A little bit too much hype about MagLev, and everyone's tendency to use dark colors on black backgrounds in their presentations. Hey, they look good on your computer screen, but it's really not readable!

torsdag, maj 29, 2008

Ruby doesn't have meta classes

OK. It's time to get rid of this terminology problem. Ruby does NOT have meta classes. You can define them yourself, but it's not the same thing as what is commonly called the meta class. That is more correctly called the eigen class. The singleton class is also better than meta class, but eigen class is definitely the most correct term.

So what is a meta class then? Well, it's a class that defines the behavior of other classes. You can define meta classes in Ruby if you want too by defining a subclass of Class. Those classes would be metaclasses.

Edit: Of course, if you actually try to define a subclass of Class you will find that Ruby doesn't allow you to do that, which means that you don't have any meta classes in Ruby. Period.

Ruby closures addendum - yield

This should probably have been part of one of my posts on closures or defining methods, but I'm just going to write this separately, because it's a very common mistake.

So, say that I want to have a class, and I want to send a block when creating the instance of this class, and then be able to invoke that block later, by calling a method. The idiomatic way of doing this would be to use the ampersand and save away the block in an instance variable. But maybe you don't want to do this for some reason. An alternative would be to create a new singleton method that yields to the block. A first implementation might look like this:
class DoSomething
def initialize
def self.call
yield
end
end
end

d = DoSomething.new do
puts "hello world"
end

d.call
d.call
But this code will not work. Why not? Because as I mentioned in my post about defining methods, "def" will never create a closure. Why is this important? Well, because the current block is actually part of the closure. The yield keyword will use the current frames block, and if the method is not defined as a closure, the block invoked by the yield keyword will actually be the block sent to the "call" method. Since we don't provide a block to "call", things will fail.

To fix this is quite simple. Use define_method instead:
class DoSomething
def initialize
(class << self; self; end).send :define_method, :call do
yield
end
end
end

d = DoSomething.new do
puts "hello world"
end

d.call
d.call
As usual with define_method we need to open the singleton class, and use send. This will work, since the block sent to define_method is a real closure, that closes over the block sent to initialize.

torsdag, maj 22, 2008

JRuby RailsConf hackfest next Thursday

LinkedIn, Joyent and Sun Microsystems is sponsoring a JRuby hackfest in conjunction with RailsConf. It will happen next Thursday from 6:30 PM in Portland, there will be some food and beer and so on. Oh, Charles, Nick, Tom and me will be there - bring your laptops and any and all questions/patches/bugs/ideas with regards to JRuby.

Read more in Charles blog, here: http://headius.blogspot.com/2008/05/jruby-pre-railsconf-hackfest-on.html

Remember to RSVP to Charles if you're coming. Space is limited so RSVP as soon as possible.

onsdag, maj 21, 2008

How large is your .emacs?

I've been reading lots of blogs and opinions about emacs the last few days. What strikes me is all of these people who brag about how large their .emacs files have become. So let me make this very clear:

If your .emacs file is longer than a page YOU ARE DOING IT WRONG.

Why? Well. Unless you are a casual Emacs user, your .emacs should not be regarded as a configuration file. Rather, the .emacs file is actually the entry point to the source repository of your own version of Emacs. In effect, when you configure Emacs you create a fork, which has it's own source that you need to maintain. This is not configuration. This is programming and you should approach it like you do all programming. What does that mean? Modularization. Clean code. Code comments. Source control. Tests. But modularization and source control are the ones that are most important for my Emacs configuration. I have loads of files in ~/emacs and every kind of extension I do has it's own kind of file or directory to put it in. The ~/emacs directory is checked out from source control, and has got customizations for different platforms. That's why my .emacs file is 4-5 lines long. Two for setting customizations that are specific to this computer, and the rest to load the stuff inside of ~/emacs. And that's all.

So how do you handle modularization of Emacs Lisp code? This won't be a tutorial. Just a few advices that might make things easier.

In no specific order:
  • (load "file.el") will allow you to just load another file.
  • (require 'cl) will give you lots of nice functionality from Common Lisp
  • I recommend you have one place where you add all your load paths. Mine look something like this:
    (labels ((add-path (p)
    (add-to-list 'load-path
    (concat emacs-root p))))
    (add-path "emacs/jde/lisp")
    (add-path "emacs/nxml")
    (add-path "emacs/own") ;; Personal elisp code
    (add-path "emacs/lisp") ;; Various elisp code, just dumped here
    )
  • Why do it like this? Well, it gives you an easier way to add full paths to your load path without repeating lots of stuff. This depends on you defining emacs-root somewhere - do define it, it can be highly useful.
  • Set custom-file. (The custom-file is the file where Emacs saves customizations. If you don't set a specific file for this, you will end up getting all customizations saved into .emacs which you really don't want.) The code for this is simple. Just do (setq custom-file "the-file-name.el")
  • Use hooks and advice liberally. They allow you to attach new functionality without monkey patching.
  • If you ever edit XML, NEVER use Emacs builtin XML editor. Instead download the excellent NXML package.
  • Learn how to use Info and customizations
  • Use Ido mode
Feel free to add other good advice in the comments. These were just a small smattering of stuff I like and which helps your environment quite seriously. But the most important part is the whole thing about keeping your .emacs extremely small!

Addendum: As Phil just pointed out (and which was part of my plan from the beginning) is that Autoloads should be used as much as possible. Also, make sure to bytecompile as much as possible.

måndag, maj 19, 2008

Break Java!

As some of you might have noticed I am not extremely fond of everything the Java language. I have spent some time lately trying to figure out how I would change the language if I could. These changes are of course breaking, and would never be included in regular Java. I've had several names for it, but my current favorite is unJava. You can call it Java .314 or minijava if you want. Anyway, here's a quick breakdown of what I'd like to see done to make a better language out of Java without straying to far away from the current language:
  • No primitives. No ints, bytes, chars, shorts, floats, booleans, doubles or longs. They are all evil and should not be in the language.
  • No primitive arrays. Javas primitive arrays are not typesafe and are evil. With generics there is no real point in having them, especially since they interact so badly with generic collections. This point would mean that certain primitive collection types can't be implemented in the language itself. This is a price I'm willing to pay.
  • Scala style generics, with usage-defined contra/co-variance and more flexible type bounds.
  • No anonymous inner classes. There is no need for them with the next points.
  • First class methods.
  • Anonymous methods (these obviously need to be closures).
  • Interfaces that carries optional implementation.
  • No abstract classes - since you don't need them with the above.
  • Limited type inference, to avoid some typing. Scala or C# style is fine.
  • Annotations for many of the current keywords - accessibility specifically, but also things like transient, volatile and synchronized.
  • No checked exceptions.
  • No angle brackets for generics (they really hurt my eyes. are there XML induced illnesses? XII?). Square brackets look so much better.
  • Explicit separation of nullable from non-nullable values.
These points are probably quite substantial together, but I still don't think the language would be that difference from Java in syntax and semantics. The usage patterns would be extremely different though. You wouldn't sacrifice any performance with these kinds of things - they wouldn't change the characteristics of the output that much, and I believe these things could make the language smaller, cleaner, and easier to work with.

lördag, maj 17, 2008

ThoughtWorks comes to Sweden

I few months back I blogged about the possibility that ThoughtWorks would come to Sweden. Well, this is now reality. I have the extreme honor to be a part of this initiative together with Marcus Ahnve (who blogged about it here). If you read that blog you will know that Marcus will head the Swedish operation. I am immensely happy about having him as my new colleague and also boss. =)

People might ask what my role in this new office will be. That's a valid question. My main goal is to stay out of trouble - and trying my very best to not scare potential customers away. Marcus is extremely capable and will handle all challenges, which means that I'll do my best to bask in the glory of opening an new office. I might also have a hand in any billable work we do, and help out with recruitment and possibly even (shudder) marketing.

Not sure if you catched the meaning in that last sentence, but let me spell two points out. We will be selling work in Sweden from day one. I will be one of the consultants sold and that means that if you have an Ruby or JRuby work you want to start up, this might be an excellent time to call your local ThoughtWorks office... =)

The recruitment point is simply this. We plan on accumulating the best people we can find - as we aim to do in every country we enter. If you feel like you could fit this bill, mail me and we can talk.

It's important to note that this operation will initially be very low profile. Don't expect center folds in DN's Economy pages. We will work mostly with word-of-mouth. So if you hear of someone that might need our help, don't hesitate to mention our name. And even though we are low profile, we will still have the resources of the whole company to draw on. A 1000 ThoughtWorkers. That feels rather good, and it should feel even better for any prospective clients.

One thing I have been a bit worried about is my commitment to JRuby, and other open source projects. I assure everyone I'll do my best to live up to these commitments. Sleep be damned!

These are exciting times. I for one is looking forward to it very much. Me and Marcus will officially start on this from June this year. Get in touch if you have any questions. It's my name separated with dots at gmail, or obini at the official thoughtworks domain.

torsdag, maj 15, 2008

Dynamically created methods in Ruby

There seems to be some confusion with regards to dynamically defining methods in Ruby. I thought I'd take a look at the three available methods for doing this and just quickly note why you'd use one method in favor of another.

Let's begin by a quick enumeration of the available ways of defining a method after the fact:
  • Using a def
  • Using define_method
  • Using def inside of an eval
There are several things to consider when you dynamically define a method in Ruby. Most importantly you need to consider performance, memory leaks and lexical closure. So, the first, and simplest way of defining a method after the fact is def. You can do a def basically anywhere, but it needs to be qualified if you're not immediately in the context of a module-like object. So say that you want to create a method that returns a lazily initialized value, you can do it like this:
class Obj
def something
puts "calling simple"
@abc = 3*42
def something
puts "calling memoized"
@abc
end
something
end
end

o = Obj.new
o.something
o.something
o.something

As you can see, we can use the def keyword inside of any context. Something that bites most Ruby programmers at least once - and more than once if they used to be Scheme programmers - is that the second def of "something" will not do a lexically scoped definition inside the scope of the first "something" method. Instead it will define a "something" method on the metaclass of the currently executing self. This means that in the example of the local variable "o", the first call to "something" will first calculate the value and then define a new "something" method on the metaclass of the "o" local variable. This pattern can be highly useful.

Another variation is quite common. In this case you define a new method on a specific object, without that object being the self. The syntax is simple:
def o.something
puts "singleton method"
end
This is deceptively simple, but also powerful. It will define a new method on the metaclass of the "o" local variable, constant, or result of method call. You can use the same syntax for defining class methods:
def String.something
puts "also singleton method"
end
And in fact, this does exactly the same thing, since String is an instance of the Class class, this will define a method "something" on the metaclass of the String object. There are two other idioms you will see. The first one:
class << o
def something
puts "another singleton method"
end
end
does exactly the same thing as
def o.something
puts "another singleton method"
end

This idiom is generally preferred in two cases - first, when defining on the metaclass of self. In this case, using this syntax makes what is happening much more explicit. The other common usage of this idiom is when you're defining more than one singleton method. In that case this syntax provide a nice grouping.

The final way of defining methods with def is using module_eval. The main difference here is that module_eval allows you to define new instance methods for a module like object:
String.module_eval do
def something
puts "instance method something"
end
end

"foo".something
This syntax is more or less equivalent to using the module or class keyword, but the difference is that you can send in a block which gives you some more flexibility. For example, say that you want to define the same method on three different classes. The idiomatic way of doing it would be to define a new module and include that in all the classes. But another alternative would be doing it like this:
block = proc do
def something
puts "Shared something definition"
end
end

String.module_eval &block
Hash.module_eval &block
Binding.module_eval &block
The method class_eval is an alias for module_eval - it does exactly the same thing.

OK, so now you know when the def method can be used. Some important notes about it to remember is this: def does _not_ use any enclosing scope. The method defined by def will not be a lexical closure, which means that you can only use instance variables from the enclosing running environment, and even those will be the instance variables of the object executing the method, not the object defining the method. My main rule is this: use def whenever you can. If you don't need lexical closures or a dynamically defined name, def should be your default option. The reason: performance. All the other versions are much harder - and in some cases impossible - for the runtimes to improve. In JRuby, using def instead of define_method will give you a large performance boost. The difference isn't that large with MRI, but that is because MRI doesn't really optimize the performance of general def either, so you get bad performance for both.

Use def unless you can't.

The next version is define_method. It's just a regular method that takes a block that defines that implementation of the method. There are some drawbacks to using define_method - the largest is probably that the defined method can't use blocks, although this is fixed in 1.9. Define_method gives you two important benefits, though. You can use a name that you only know at runtime, and since the method definition is a block this means that it's a closure. That means you can do something like this:
class Obj
def something
puts "calling simple"
abc = 3*42
(class <<self; self; end).send :define_method, :something do
puts "calling memoized"
abc
end
something
end
end

o = Obj.new
o.something
o.something
o.something
OK, let this code sample sink in for a while. It's actually several things rolled into one. They are all necessary though. First, note that abc is no longer an instance variable. It's instead a local variable to the first "something" method. Secondly, the funky looking thing(class <<self; self; end) is the easiest way to get the metaclass of the current object. Unlike def, define_method will not implicitly define something on the metaclass if you don't specify where to put it. Instead you need to do it manually, so the syntax to get the metaclass is necessary. Third, define_method happens to be a private method on Module, so we need to use send to get around this. But wait, why don't we just open up the metaclass and call define_method inside of that? Like this:
class Obj
def something
puts "calling simple"
abc = 3*42
class << self
define_method :something do
puts "calling memoized"
abc
end
end
something
end
end

o = Obj.new
o.something
o.something
o.something

Well, it's a good thought. The problem is that it won't work. See, there are a few keywords in Ruby that kills lexical closure. The class, module and def keywords are the most obvious ones. So, the reference to abc inside of the define_method block will actually not be a lexical closure to the abc defined outside, but instead actually cause a runtime error since there is no such local variable in scope. This means that using define_method in this way is a bit cumbersome in places, but there are situations where you really need it.

The second feature of define_method is less interesting - it allows you to have any name for the method you define, including something random you come up with at runtime. This can be useful too, of course.

Let's summarize. The method define_method is a private method so it's a bit problematic to call, but it allows you to define methods that are real closures, thus providing some needed functionality. You can use whatever name you want for the method, but this shouldn't be the deciding reason to use it.

There are two problems with define_method. The first one is performance. It's extremely hard to generally optimize the performance of invocation of a define_method method. Specifically, define_method invocations will usually be a bit slower than activating a block, since define_method also needs to change the self for the block in question. Since it's a closure it is harder to optimize for other reasons too, namely we can never be exactly sure about what local variables are referred to inside of the block. We can of course guess and hope and do optimistic improvements based on that, but you can never get define_method invocations are fast as invoking a regular Ruby method.

Since the block sent to define_method is a closure, it means it might be a potential memory leak, as I documented in an older blog post. It's important to note that most Ruby implementations keep around the original self of the block definition, as well as the lexical context, even though the original self is never accessible inside the block, and thus shouldn't be part of the closed environment. Basically, this means that methods defined with define_method could potentially leak much more than you'd expect.

The final way of defining a method dynamically in Ruby is using def or define_method inside of an eval. There are actually interesting reasons for doing both. In the first case, doing a def inside of an eval allows you to dynamically determine the name of the method, it allows you to insert any code before or after the actual functioning code, and most importantly, defining a method with def inside of eval will usually have all the same performance characteristics as a regular def method. This applies for invocation of the method, not definition of it. Obviously eval is slower than just using def directly. The reason that def inside of an eval can be made fast is that at runtime it will be represented in exactly the same way as a regular def-method. There is no real difference as far as the Ruby runtime sees it. In fact, if you want to, you can model the whole Ruby file as running inside of an eval. Not much difference there. In particular, JRuby will JIT compile the method if it's defined like that. And actually, this is exactly how Rails handles potentially slow code that needs to be dynamically defined. Take a look at the rendering of compiled views in ActionPack, or the route recognition. Both of these places uses this trick, for good reasons.

The other one I haven't actually seen, and to be fair I just made it up. =) That's using define_method inside of an eval. The one thing you would gain from doing such a thing is that you have perfect control over the closure inside of the method defined. That means you could do something like this:
class BinderCreator
def get
abc = 123
binding
end
end

eval(<<EV, BinderCreator.new.get)
Object.send :define_method, :something do
abc
end
EV

In this code we create a new method "something" on Object. This method is actually a closure, but it's an extremely controller closure since we create a specific binding where we want it, and then use that binding as the context in which the define_method runs. That means we can return the value of abc from inside of the block. This solution will have the same performance problems as regular define_method methods, but it will let you control how much you close over at least.

So what's the lesson? Defining methods can be complicated in Ruby, and you absolutely need to know when to use which one of these variations. Try to avoid define_method unless you absolutely have to, and remember that def is available in more places than you might think.

Would Type Inference help Java

My former colleague Lars Westergren recently posted a blog (here) about type inferencing, posing the question whether type inference would actually be good for Java, and if it would provide any benefits outside of just "less typing".

In short: no. Type inferencing would probably not do much more than save you some typing. But how much typing it would save you could definitely vary depending on the type of type inference you added. The one version I would probably prefer is just a very simple hack to avoid writing out the generic type arguments. One simple way of doing that would be to allow an equals sign inside of the angle brackets. In that case you could do this:
List<=>      l  = new ArrayList<String>();
List<String> l2 = new ArrayList<=>();
Of course, you can do it on more complicated expressions:
List<Set<Map<Class<?>, List<String>>>> l = new ArrayList<=>();
This would save us some real pain in the definition of genericized types, and it wouldn't strip away much stuff you need for readability. In the above examples it would just strip away one duplication, and you don't need that duplication to read it correctly. The one case where it might be a little bit harder to read would be if you defined a variable and assigned it somewhere else. In that case the definition would need to carry the type information, so the instantiation would use the <=> syntax. I think that would be an acceptable price to reduce the verbosity of Java generics.

Another kind of generics that would be somewhat useful is the kind added to C#, which is only local to a scope. That means there will be no type inferencing of member variables, method parameters or return values. Of course, that's the crux of Lars question, since this kind of type inference potentially removes ALL type information in the current text, since you can do:
var x = someValue.DoSomething();
At this point there is no easy way for you to know what the type of x actually is. Reading it like this, it looks a bit frightening if you're used to Java type tags, but in fact this is not what you would see. In most cases you have a small method - maybe 5-15 lines of code, where x is being used in some way or another. In many cases you will see methods called on x, or x used as argument to method calls. Both of these usages gives you clues about what it might be, but in fact you don't always need to know what type it is. You just need to know what you can do with it. And that's exactly what Java interfaces represent. So for example, do you know what class you get back from Collections.synchronizedMap()? No, and you shouldn't need to know. What you do know is that it's something that implements Map, and the documentation says that it is synchronized, but that is it. The only thing you know about it is that you can use it as a map.

So in practice, the kind of type inference C# adds is actually quite useful, clean, and doesn't cause too much trouble - especially if you have one of those fancy ideas that do method completion... =)

From another angle, there are some things that type inference could possible do, but that you will never see in Java. For example, say that you assign a variable to something, and later you assign that variable to some other value. If these two values are distinct types that doesn't overlap in the inheritence chain, you will usually get an error. But if you have an advanced type system, it will do unification for you. The basic versions will just find the most common supertype (the disjunction), but you can also imagine the compiler injecting a new type into your program that is the union of the two types in use. This will provide something similar to duck typing while still retaining some static type safety. If your type system allows multiple inheritence, the synthetic union type might even be a subclass of both the types in question.

So yeah. The long answer is that you can actually do some funky stuff with type inference that doesn't immediately translate to less typing. Although less typing and better abstractions is what programming languages are all about, right? Otherwise assembler provides everything we want.

onsdag, maj 14, 2008

A New Hope: Polyglotism

OK, so this isn't necessarily anything new, but I had to go with the running joke of the two blog posts this post is more or less a follow up to. If you haven't already read them, go read Yegge's Dynamic Languages Strikes Back, and Beust's Return Of The Statically Typed Languages.

So let's see. Distilled, Steve thinks that static languages have reached the ceiling for what's possible to do, and that dynamic languages offer more flexibility and power without actually sacrificing performance and maintainability. He backs this up with several research papers that point to very interesting runtime performance improvement techniques that really can help dynamic languages perform exceptionally well.

On the other hand Cedric believes that Scala is bad because of implicits and pattern matching, that it's common sense to not allow people to use the languages they like, that tools for dynamic languages will never be as good as the ones for static ones, that Java generics isn't really a problem, that dynamic language performance will improve but that this doesn't matter, that static languages really hasn't failed at all and that Java is still the best language of choice, and will continue to be for a long time.

Now, these two bloggers obviously have different opinions, and it's really hard to actually see which parts are facts and which are opinions. So let me try to sort out some facts first:

Dynamic language have been around for a long time. As long as statically typed languages in fact. Lisp was the first one.

There have been extremely efficient dynamic language implementations. Some of the Common Lisp implementations are on par with C performance, and Strongtalk also achieved incredible numbers. As several commenters have noted, Strongtalks performance did not come from the optional type tags.

All dynamic languages in large use today are not even on the same map with regards to performance. There are several approaches to fixing these, but we can't know how well they will work out in practice.

Java's type system is not very strong, and not very static, as these definitions go. From a type theoretic stand point Java does not offer neither static type safety nor any complete guarantees.

There is a good reason for these holes in Java. In particular, Java was created to give lots of hints to the compiler so the compiler can catch errors where the programmer is insoncistent. This is one of the reasons that you very often find yourself writing the same type name twice, including the type name arguments (generics). If the programmer makes a mistake at one side, the compiler will be able to catch this error very easily. It is a redundancy in the syntax that makes Java programs very verbose, but helps against certain kinds of mistakes.

Really strong type systems like those Haskell and OCaML use provide extremely strong compile time guarantees. This means that if the compiler accepts your program, you will never see any runtime errors from the type system. This allows these compilers to generate very efficient code, because they know more about the state of the application at most points in time, compared to the compiler for Java, which knows some things, but not nearly as much as Haskell or OCaML.

The downside of really strong type systems is that they disallow some extremely common expressions - these are things you intuitively can imagine, but it can't be expressed within the constraints of such a type system. One solution to these problems is to add higher kinds, but these have a tendency to create more complexity and also suffer from some of the same problems.

So, we have three categories of languages here. The strongly statically checked ones, like Haskell. The weakly statically checked ones, like Java. And the dynamically checked ones, like Ruby. The way I look at these, they are good at very different things. They don't even compete in the same leagues. And comparing them is not really a valid point of reasoning. The one thing that I am totally sure if is that we need better tools. And the most important tool in my book is the language. It's interesting, many Java programmers talk so much about tools, but they never seem to think about their language as a tool. For me, the language is what shapes my thinking, and thus it's definitely much more important than which editor I'm using.

I think Cedric have a point in that dynamic language tool support will never be as good as those for statically typed languages - at least not when you're defining "good" to be the things that current Java tools are good at. Steve thinks that the tools will be just as good, but different. I'm not sure. To a degree I know that no tool can ever be completely safe and complete, as long as the language include things like external configuration, reflection and so on. There is no way to include all dynamic aspects of Java, but using the common mainstream parts of the language will give you most of these. As always this is a tradeoff. You might get better IDE support for Java right now, but you will be able to express things in Ruby that you just can't express in Java because the abstractions will become too large.

This is the point where I'm going to do a copout. These discussions are good, to the degree that we are working on improving our languages (our tools). But there is a fuzzy line in these discussions, where you end up comparing apples and oranges. These languages are all useful, for different things. A good programmer uses his common sense to provide the best value possible. That includes choosing the best language for the job. If Ruby allows you to provide functionality 5 times faster than the equivalent functionality with Java, you need to think about whether this is acceptable or not. On the one hand, Java has IDEs that make maintainability easier, but with the Ruby codebase you will end up maintaining a fifth of the size of the Java code base. Is that trade off acceptable? In some cases yes, in some cases no.

In many cases the best solution is a hybrid one. There is a reason that Google allows more than one language (C++, Java, Python and JavaScript). This is because the languages are good at different things. They have different characteristics, and you can get a synergistic effect by combining them. A polyglot system can be greater than the sum of it's parts.

I guess that's the message of this post. Compare languages, understand your most important tools. Have several different tools for different tasks, and understand the failings of your current tools. Reason about these failings in comparison to the tasks they should do well, instead of just comparing languages to languages.

Be good polyglot programmers. The world will not have a new big language again, and you need to rewire your head to work in this environment.