Monday, February 25, 2008

Carbide 1.3 End Game

We are in the last weeks of testing Carbide 1.3: the code is frozen, the test team is rechecking everything, and the beta testers are trying out the release candidate build. This is usually the time when some bug that has been around all along makes a surprise appearance in a really troubling way.

Last week we started getting some reports of the entire workbench locking up. People who were really beating on the tools all day said that often after a paste or save operation the UI would be unresponsive. Of course we couldn't reproduce the problem but we had two interesting clues: some people said the UI started responding again after a while and one beta tester sent in a screenshot of the frozen workbench.

Everyone has their own personal "time out" setting: the amount of time they are willing to wait for an unresponsive application to start working again before they kill it or restart their computer or chuck it over the cube wall. So that some people said the workbench came back to life after a while probably indicated we had some lengthly process blocking the UI thread, not that Carbide was frozen forever.

The screenshot gave us a few more clues: the progress indicator said the CDT indexer job was just starting and that the text the user was pasting hadn't yet appeared in the editor. I started trying to reproduce the problem by forcing a project to reindex and then repeatedly pasting code.

After a lot of pasting we found that the problem had nothing to do with the paste operation but with the use of the control key (as in ctrl-v to paste or ctrl-s to save). With the control key down CDT was asked to produce a hyperlink. Normally this happens quickly because the file has already been indexed and the results cached, but when it's dirty it proceeds to do some indexing on the UI thread which leaves the workbench completely unresponsive.

It seemed like the right thing to do in this case was to just not do the indexing if the cache was stale. We turned to the CDT team to confirm this and within a few hours we had changes in place to prevent this problem and a couple other related issues too.

So now I'm starting this week like I did the last: hoping Carbide 1.3 can slide out the door without any serious issues popping up.

Wednesday, February 20, 2008

Looking at DSF

Our next major release of Carbide.c++ will be based on Ganymede which will include the quickly maturing Debug Services Framework. The DSF team has been busy joining the framework with CDT 5.0 and beefing up the reference implementation that works with gdb on linux.

The Carbide.c++ debugger uses the CDI apis to let our debug engine provide services to the common C++ debug support in CDT. That has worked pretty well but there are lots of new things we would like to do that would be a lot easier using DSF so I'm going to start looking into bringing up our debugger in this new, more flexible, and highly asynchronous environment. I haven't looked at DSF in any depth since last September and there has been a lot of refactoring since then but now that things have settled down with M5 I'm going to dive back in.

Monday, February 11, 2008

Bug Hunting in Big D

The people in our beta group have been really positive while sending plenty of constructive criticism about things we need to improve. It's been very satisfying working with them to track down some of the bugs that escaped us last time. One was a report about debugger performance: some people said that when debugging they found that "stepping through code was really slow."

These kinds of problems can be really difficult to reproduce and fix because there are so many variables. Only a few people were reporting the problem and it wasn't consistent. We couldn't reproduce it at all.

A lot of things can impact debugger stepping performance including the speed of the debug connection and the number of variables displayed each time, but none seemed to be a factor in this case.

As people ran into the problem we started to get more specific information: we heard that not only was stepping in the debugger slow but the entire Eclipse environment became really slow. Then someone noticed that everything on their computer was slow: every application, menu etc.

Next we heard from a developer on one of Nokia's software teams in Dallas. He could reproduce the problem but it only happened when debugging a particular project. I setup the same project and tried the same steps but couldn't reproduce the problem.

A trip to Dallas from our office here in Austin can easily occupy a long day, much of it spent in traffic, so first we tried a screen sharing session. That confirmed I was following the same repro steps with the project but getting different results: my debug session hummed along while my Dallas colleague's ground to a halt with the Java VM process sucking up 80 - 90% of the cpu. But the most interesting thing was that the application he was debugging (a music player) was sending a huge amount of text to the console view. The entire time a song was playing the console view was flooded with debug output. After playing a few songs the audio began to skip and the big slow down began.

I pulled a few long songs from my music collection and while my console view was also flooded with various debug messages I didn't see any performance hit. Time for a road trip.

When I got to Dallas we quickly reproduced the problem and all the evidence pointed to the console output. The platform debug console has a setting that limits the amount of text in a console. My first theory was that we were overloading whatever method keeps the console under this limit. Maybe we were dumping in so much text it was thrashing about trying to keep the limit enforced?

Eclipse does so much cool stuff for you automatically I assumed the console limit worked that way too. But that's built into the debug platform's ProcessConsole and for various reasons ours is built on top of MessageConsole instead. So we didn't have any limit on the amount of text in a console. Over time all the text dumped into our console consumed more and more memory.

Limiting the amount of text in the console view was simple enough and has fixed the problem. I'm still not sure why the effect of the big memory leak on my Dallas colleague's computer was so severe while my similarly configured laptop wasn't bothered. In any event the whole experience shows how initial descriptions of bugs can be misleading and that you need work closely with the people using the tools all day in order to track down difficult to reproduce problems.