Sunday, May 29, 2011

Web Workers faster than (naive) C++ [edited]


[EDIT] The main problem with the C++ version was probably the platform (Cygwin). So I re-ran the same program on Linux inside VirtualBox on the same machine and the results are not so catastrophic. I also ran the Python program with PyPy and it's twice as fast as CPython on that problem (Thanks to Alex Gaynor for trying it first).

I don't mean that to stir the controversy. If someone with great C++ skills could explain what I did wrong, I'd be glad to learn. So here is the story.

Multi-language parallel processing comparisons

It all began last week with Google Code Jam problem B in round 1A. The problem is the following. Two guys are playing hangman. They have a common dictionary of allowed words. The guy who tries to guess has a very simple strategy. He has a fixed list of letters that he will propose in order. However, he will skip letters that cannot appear in the word in the current state of the game. Each time he proposes a letter that does not appear in the word, the other guy makes one point. Your goal is to find the word that will make the most points, knowing the dictionary and the ordered list of letters of the guy who guesses.
I have written a solution with the same algorithm in four languages: Python, Java, C++ and JavaScript. I also wanted to get some knowledge of the parallel processing capabilities of those languages. In Python, I used the multiprocessing module, in Java, package java.util.concurrent with ExecutorService, in C++, OpenMP and in JavaScript HTML5 web workers. Here are the running times.

Single
Multi
1 worker 4 workers 6 workers
Java 26" 27" 18.5" 17.5"
C++ Linux 33" 33" 20" 21"
C++ Cygwin 58" 59" 30'00" 28'00"
PyPy 1'04" 1'08" 27" 26.5"
Python 2'30" 2'35" 50" 44"
JS Chrome 44" 45" 15.3" 14"
JS Opera 1'17" 1'17" 1'21" 1'35"
JS IE 1'30" No web workers
JS Firefox 1'34" crash
1'14" for 9
crash crash

Best times are highlighted in green in each column, worst times in red, crashing or unavailable solutions in black.

The code is available in github, with extensive explanations on how to build and run it. The HTML5 web workers solution can be tested right in your browser. To run the program in non-web worker mode, just click the "Run without workers" button. Be aware that the browser will probably tell you every few seconds that a script is making the page unresponsive. You need to allow it to continue if you want it to finish.
To run the program with web workers, you need to create a pool first. When you do that, a row of green squares will appear, to track the activity of allocated workers. Then, launch the computation with "Run". The "Queue size" will move briefly to 10, workers will turn red when busy. The result will appear once the queue is empty and all workers have finished.

Observations

About the C++ program

My first surprise is that I could not make my C++ program run as fast as the Java solution. My C++ skills are rusty, for sure, but when I wrote some numerical things not too long ago, I had no problem making them go two or three times faster than Java. This time, there is a lot of string and map handling and I guess my performance problems might come from too much memory allocation. I used gprof to locate potential hotspots, but nothing meaningful appears. I would genuinely be interested if some C++ wizard could explain how to fix that. [At least on Linux it's better than with Cygwin]
C++ 4-workers CPU profile

The second suprise is that using parallel processing actually made things worse, much much worse! Read the numbers: 30 minutes for the 4-workers C++ when Chrome takes 15 seconds. Actually it reinforces the hypothesis that my performance problems come from memory allocation. The multiple threads of execution are fighting each other for memory and provoke contention. That would explain the fact that the program only consumes 25% CPU, a big part of which is system time. [Yet again, on Linux the multi-threaded version sees a good increase in performance. Cygwin probably has to do much more work.]

About Firefox 4

The sequential solution, using only the main JavaScript thread, has a very curious memory profile.
Firefox non-worker solution

It looks like memory is never reclaimed during the execution, only after the function has finished running.

When running the 4-worker solution, Firefox CPU profile is similar to the C++ one: only 40% used and at least three quarters of that is system time. Using Process Explorer, I would conclude that Firefox uses threads for workers and probably meets memory contention too.

What is more troublesome is that Firefox cannot reliably compute the solution using web workers. Firebug catches some exceptions during execution. But those exceptions never occur in other browsers, or even in Firefox when running without workers. Among the ten tasks that workers have to process, there is generally one or two that produce an exception. It is sometimes possible to solve nine of the ten cases with one worker, but not always. And with four workers, it cannot even finish four cases. I have really no idea what is wrong.

About Opera 11

Opera performs quite well without workers. With workers, it performs the same. In fact, the more workers you add, the worse it performs. I guess that it uses "green threads", not native ones. No new threads appear in Process Explorer when launching workers and CPU use is always around 25%. With more workers, there is more work for the internal scheduler and some more seralization of data to do. That's what explains the slight degradation of running times.

About Internet Explorer 9

IE9 does not support web workers, so I only tested the standard solution. The performance is reasonable. It is worse than Chrome and Opera, but a bit better than Firefox.

About Chrome 13

Chrome is impressive, no doubt about that. The standard JavaScript solution is already the fastest of the browser pack (twice faster than Firefox and IE), and it benefits greatly from added workers. Using separate OS processes (like Python multiprocessing module) makes it immune to memory contention.

About Java

Java is a still a bit faster than Chrome V8 JavaScript in single-threaded mode. However, the gain when using multiple workers is not as big as expected: only 1.5 times faster with four workers when Python or Chrome go three times faster with four processes. I don't know why this is so and I don't have a precise idea how to diagnose. It might be a kind of memory contention, as with other multi-threaded implementations, except it does not appear as system time, because the JVM manages memory by itself.

About Python

I initially wrote my solution in Python, because the language feels so good. When ran with PyPy (1.5.0-alpha0) it can perform twice as fast as with CPython. That's great news!

Conclusions and open questions

I welcome all comments, on why my C++ solution is so slow, why Firefox fails with workers, and anything else. In the meantime, I will finish and release my micro-library to provide a Python-like Pool API above web workers, knowing that today, only Chrome can use it for heavy lifting.

Thursday, May 26, 2011

Taking advantage of multiple cores with JavaScript web workers

I'm at it again! 

This time, I wanted to test the performance of JavaScript for solving the same problem I already described previously. And I chose to jump at the chance to perfect my knowledge of HTML5 web workers.

My main resource for learning how to use web workers was this introduction on HTML5Rocks. I used Chrome v13 for my tests. If you try this at home and point to local filesystem files, don't forget to launch a fresh Chrome instance with the --allow-file-access-from-files command-line switch, otherwise you will receive some strange security errors. The other possiblity is to run from a local web server.

At first, I couldn't seem to start a worker correctly. Or at least, my worker did not seem to receive messages. I was calling:
self.addEventListener('message', handleMessageFunction, false);

I had to change it to the following form to make it work:
self.addEventListener('message', function(e) { slave.handleMessage(e); }, false);

So, passing just a function reference as second argument did not seem to work, even though the function existed, was valid, was not a prototype method and had the right signature.

Once I had overcome that "glitch", everything went smoothly. I re-implemented my Python algorithm in pure JavaScript. I tried it directly (without workers). I had to acknowledge a few times that, indeed, a script on the page was taking some time to complete. And it finished in about 45". Nice performance for JavaScript!

To the core(s)

Having been spoilt by the nice Pool API from Python multiprocessing package, I embarked upon a journey to implement something similar with web workers. Basically, I created a protocol of messages to be exchanged between the workers and the pool driver. It is not polished, but it works well already. You can see the result here. Quick explanation:
  • "Create pool" creates the worker pool with the requested number of workers. Workers are materialized by colored squares with a number. When the square is red, the worker is busy; when green, it's idle.
  • "Run" sends all problem cases to the pool. The size of the queue should increase. It is the number of tasks not yet taken by a worker.
  • "Kill" kills all workers and drops the pool, so that you can create a new one with a different number of workers.

Web worker pool demo
The input field is preloaded with the large problem set. On Chrome with four workers, the runtime is divided by three, compared to the single-threaded version: the problem is solved in about 15". Chrome does not seem to impose limitations on the number of workers that can be started. I tried with ten, and the results was almost identical. My machine was, of course, 100% busy.

Opera performs nicely on this one too, with the same code. Single-threaded solution in 75", four-worker solution in 75". Opera correcly creates the requested number of workers, but it probably does not use OS-level threads. They might only be "green-threads", with an internal scheduler, which would explain why only 25% of the CPU is consumed.

Firefox 4 gives me some problems, maybe because of memory limits or the size of the input value. The program runs unmodified, but you have to change the first input line from 10 to 8. I don't know why it chokes on the ninth and tenth cases. The single-worker version runs in 76". The four-worker one does not consume all available CPU. It's stuck at about 50% and consumes a lot of system time. Only three workers are busy simultaneously. And it runs in 230". So much for the performance gain.

And IE9, with its native HTML5 support... just kidding.

tl;dr

Chrome is fast, has great support for web workers. Demo here.

Monday, May 23, 2011

C++ slower than Java and Python!

... when badly programmed

I was still trying to squeeze the most performance out of my solution to the hangman problem from round 1A of Google Code Jam 2011 and so I decided to give C++ a try. C++ is like a long forgotten friend to me. I started my paid carreer with it, almost twenty years ago. I learned some tricks with the STL, template methods and other fun things. I already had a humble background in C, so pointers, memory manipulation and the like were still fresh in my head.

Today I simply tried to port my Python algorithm to C++, as straightforwardly as possible - I don't know if "straightforwardly" is a very usual adverb, but I'm French, so please excuse me if it isn't. With some vague memories about memory allocation I tried to use reference types as much as possible, to avoid unnecessary copying. I used string, vector<>, map<> quite easily. I wrote very simple input and output with cin and cout. I had a valid C++ program in no time. I ran it on the small input set. It was immediate and gave the right result. "Everything's under control" I thought.

Then I ran it over the large set, and it took 3'30": one minute longer than the Python version! Can you imagine?!

gprof to the rescue. I tweaked one heavily used string manipulation function. 3'. Still not fantastic. I realized that map<> was based on ordering and offered log(n) access time: I switched to unordered_map<> to get a hashmap implementation. Around 2'30". Now it was as fast as Python (but still five times slower than the Java version). And now gprof did not give my anything interesting. A few mangled symbols appeared on top, especially with -O3 compilation. I tried to reduce the number of string allocations, using string const & as much as possible, but to no avail. I started to think that C++ was definitely not good at string handling. I thought that maybe using Cygwin g++ was suboptimal. I tried g++ on Linux. Still the same order of performance.

And then I saw the light

In the middle of my main algorithm loop, I was swapping big maps. Real map objects, not references. So each iteration of the loop would do a complete copy of the previous map. Arghh! C++ memory allocation is quirky, coming from managed languages like Java, Python or JavaScript. Replacing those maps by references gave the top rank back to C++: 17" for the single-threaded version, running on a single core and not even compiled with optimization. L'honneur est sauf.

Lessons learned

  • It's far less difficult to work with C++ and the STL than I expected, for problems like those from Code Jam.
  • However, manual memory management is something you always have to keep in mind.
  • My newbie usage of gprof did not allow me to track superfluous data structure copying. If you have any suggestions in that domain, I'll be glad to hear them.

Sunday, May 22, 2011

Taking advantage of multiple cores with Python and Java

The first round of Google Code Jam 2011 has just ended. I participated in round 1A, waking up at 3AM on a week-end day (to tell you how motivated I was). The first problem was not too difficult, but it still took me a good 45 minutes to solve. Then came the second one, The Killer Word: the one where you have to find the hardest word for your opponent to guess in a game of hangman. I implemented the solution the way it was described, but my first small attempt failed.

I dug into the code again, and found a rule I had not implemented. Alas, my next attempt was also a failure. I had to try harder. Another bug fix later, my third submission was still not right. Time was running low. About twenty minutes before the end my rank was around 900th and falling. If I could not submit another valid solution, I would probably fall behind the qualifying 1000th rank.

After another pass over the code, I found yet another subtlety that I had not taken into account. Clickety-clickety-click (I have a "Das Keyboard" that really clicks, if you want to know). Another submission. Fingers crossed. And... yes! Now it was right. The small problem set was now solved. My rank raised to 500th. I was saved. Time to download a large problem set and run the program over it. Eight minutes should be plenty. The dictionary size was multiplied by 100 and the list of letters by 10, but it should be ok... Except it was not. Not at all, by a long shot.

Sleeping over it

With the knowledge of my qualification and still the problems in my head, I went to bed. When I woke up, I thought that perhaps a multithreaded implementation could have saved my attempt at solving the large problem set. As a first step, to speed up the program, I rewrote it from Python to Java. Straightforward, if a little dull. It ran faster, but still it was impossible for that algorithm to succeed in less than eight minutes on the large set. So: back to thinking.

I implemented a solution another way: instead of computing the score of each dictionary word in turn, I processed all words at once, keeping only the best one at the end. I programmed that new algorithm in Python again, because I really love its "runnable pseudo-code" feeling: no need to write types, literal syntax for everyday structures (tuples, arrays, dictionaries). I ended up with a valid solution that managed to solve the large set in about 2'30". Not too bad for Python.

Going multicore

The Python solution was now adequate, but the idea to take advantage of the mutiple cores of my machines was still in my head. So I decided to crawl the web and the excellent reference documentation for Python. My first reflex was to look at the thread and threading modules. But, as you probably know, the standard CPython implementation relies on a Global Interpreter Lock (the GIL) that practically makes multithreading useless, except for handling concurrent blocking I/O. Fortunately, the documentation pointed me to the multiprocessing module. The basic principle is that instead of creating thread inside the Python VM, that module allows to create OS-level processes, each one running a distinct Python VM, but with communication channels established between them, so that you don't have to leave the comfy Python environment. And of course, it's as platform-independent as possible.

For my needs, I looked at the Process class, but rapidly learned about the Pool class. The API is perfect, almost magic. You create a Pool object (Python guesses the number of CPU for you). You assign it all the functions you want to run asynchronously. You close it, which means you have no more work for it to do. You join it (you wait for all workers to process the functions). And you only have to get the results and sort them. That's all. In code, it looks like that:

 def solve_with_pool(cases):  
   from multiprocessing import Pool  
   
   pool = Pool()  
   async_results = [pool.apply_async(solve_case, args=(case,)) 
                    for case in cases]  
   pool.close()  
   pool.join()  
     
   results = [ar.get() for ar in async_results]  
   results.sort()  
   return results  

It's difficult to get easier than that. solve_case() is the function that, as the name implies, solves a case and returns the result as a tuple whose first element is the case index. That's why the simple call to results.sort() works: it sorts all results by index. In fact, it's probably useless, because the results list is built in the right order... Anyway, with that in place, the program runs now in 50", which is three times faster than the single-threaded version, on my quad-core machine. The speedup factor is not four, because the division of work I have made is coarse-grained and there are only ten cases in the large set. So one of the processes probably got to work on two large cases and ended some time after the others had finished. Still, not too bad for ten lines of code.

And Java?

If you read the title of this post thoroughly, you've seen Java in there. So. I remembered Java 1.6 standard library included a part of Doug Lea fork-join framework that was previously distributed as a separate package. Except it does not. It's planned to be included in the upcoming JDK 1.7. So in the meantime, I'll have to resort to simpler constructs. But at least, I think java.util.concurrent will give me a higher level interface than direct use of Thread and synchronization.

The principles are the same as the Python version, but of course, Java is a bit more verbose, mainly because of explicit typing and also because it lacks first-class functions. So I have to create a FutureTask, built from an anonymous class derived from Callable. Otherwise, the API is at the same level: a pool of executors, each one is executed, the pool is shut down, we wait for all computations to end (for one day, because we are forced to provide a timeout value), and then we collect all results. The managed exceptions also add a bit to the verbosity: both awaitTermination() and get() can raise some exceptions that are of no interest in our case.


  public static Result[] solve_with_pool(Case[] cases) {
    ExecutorService pool = Executors.newFixedThreadPool(4);

    FutureTask<Result>[] fresults = new FutureTask[cases.length];
    for (int i = 0; i < cases.length; ++i) {
      final Case c = cases[i];
      fresults[i] = new FutureTask(new Callable<Result>() {
        public Result call() {
          return solve_case(c);
        }
        });
      pool.execute(fresults[i]);
    }
    pool.shutdown();
    try {
      pool.awaitTermination(1, TimeUnit.DAYS);
      Result[] results = new Result[cases.length];
      for (int i = 0; i < cases.length; ++i) {
        results[i] = fresults[i].get();
      }
      return results;
      
    } catch (InterruptedException e) {
    } catch (ExecutionException e) {
    }
    return null;
  }

The net gain from going multithread in my case is less impressive than in Python. The single-threaded program solved the large input set in 25" while the multi-threaded one took 18". I have not investigated why.

Conclusion

Well, turning a single-threaded problem solving program into a multi-process or multi-threaded version is not too difficult. The available APIs sit at the right level of abstraction, both in Python and Java. The runtime gains are significant. Of note, the Java program, when running full steam on all four cores tended to make the whole PC sluggish, with the mouse pointer moving less smoothly than it should.

Now I know. And hopefully you do too.