Programming Tales

Note: I'm retiring this web page in favour of a more general-purpose blog.

All opinions expressed in this webpage are my own. <- My employer made me say that. (Seriously!)

This page documents the programming "adventures" I have had since I joined my company. Nothing is too small to be listed down. This is because of my bad experience in my previous company. If you don't blow your own horns, you'll lose out in promotions and renumerations. It's not how good you are, but how good people think you are. I have learnt my lesson. Have you? :-)

These tasks are out of my immediate job scope, but still within the same project. Sometimes I do it due to my own interest, sometimes arrowed to do it. There is a tendency to go into a comfort zone and be too specialized. I prefer to keep my skills broad-based, so that's why you'll see a mix of tales below.

This is the sanitized version: no internal code names are mentioned, no confidential information is present, and no actual products are named. Also, you have no idea where I work and what I do, do you? :-)

Tracing why device was unable to enter low-power mode

Date: Dec 2008. Time: 1 day. Difficulty: Easy.

A bug report from the test lab: the old device was unable to enter low-power mode. Specifically, it exited the mode as soon as it entered it. However, the new device was able to do so — using the same firmware.

Seemed like a straightforward case of a hardware defect. Or was it?

I was puzzled why it failed on the old device. It was more likely that the new one should fail instead. Then, I realized the firmware had never been tested on the old device — it was meant for the new device, but was backward compatible.

I traced the source. After the device goes into low-power mode, one of the routines wait for an event to wake it up. There are only two places that wake it up, one is when the event really occurred, and the other restores some state after the device has woken up.

I know the event didn't occur, because there was a event log attached to it. There was no event logged, hence the event didn't occur. Thus, it must be due to the other cause. It's as simple as that. But how?

The evidence stared at me all along, but I was confused by the event log (it wasn't in chronological order) and it took me a while before I realized what was going on.

Suppose the event occur once in low-power mode. The first thing the code does is to wake-up the rest of the system. This is asynchronous. It then sets a flag to say that it is awake.

In the wakeup part, the code sets the event if it is in the low-power mode.

See the problem? If the wake-up part runs faster, the bug occurs. If it runs slower, then the bug doesn't occur because the flag has been cleared! It is a race condition!

Because the event has been set, the code will never wait properly again. It will find that it already has an event, hence it processes it straightaway.

Slow pulsing LED

Date: Sep 2008. Time: 3 days. Difficulty: Easy.

A binary on-off LED looks crude beside a slow pulsing LED.

How difficult is it to achieve a slow pulsing LED? Not very. You just have to simulate the duty cycle by toggling the LED very quickly. (We are talking about microseconds here.)

Update: a colleague later demonstrated the same functionality in an easier manner. First, he picked an isolated LED (not tied to the UI, unlike me. I had to untie the LED from UI control first). Then, he used the right timing functions. I overlooked them in my haste.

Adding auto-sleep feature

Date: Jul 2008. Time: 3 days. Difficulty: Easy.

To save power, we need to put the hardware components to low-power mode when they are not in use. It was difficult to do so for one particular component due to the way it was used.

Doing it the conventional way was too troublesome. Since the component was always assumed to be on, clients just used it freely. It was easier for the component to auto-sleep when it was not needed.

It turned out to be easier to do than expected. I put in some effort to do a writeup to help convince the others, since it's not just what you do that's important, but also what others think.

Tracing memory corruption

Date: Jun 2008. Time: 2 days. Difficulty: Easy.

A memory location was corrupted, the periodic memory checker detected it and asserted. Unfortunately, the event had long past.

How do you trap on a write without memory protection? You look at the debug watchpoint feature.

I was able to put together the code to program the debug registers after reading the ARM Technical Reference Manual (a very terse read) and grepping the source code to see how to embed assembler code for our C compiler.

I was able to it to work on the first few tries. This would have been rated medium if I hadn't got it to work so effortlessly.

This method only works if the same memory location is corrupted — our ARM processor has only two watchpoint registers. Luckily for us, this is the case.

Update: a colleague of mine finally enabled the memory protection a week later. Unfortunately, the bug had still not surfaced, so I wasn't able to score point. The bug was caught one month later.

VM2

Date: Mar - Apr 2008. Time: 3 weeks. Difficulty: Easy.

After the successful implementation of VM, it's time to update it. VM2 is better than VM in every possible way: better designed, leaner, faster and more flexible.

Unfortunately, it is also not compatible with VM. However, I hope to make it 95% compatible at the script level.

VM2 took much less time to develop because only the core implementation was changed; there were very few functional changes. A huge chunk of the code was taken from VM with minimum changes. There are three main changes:

Improvement to the encoding format (simplifies decoding)
Improvement to vector implementation (not using C++ templates anymore)
More modular code (easier to add new functionality)

Random NVRAM corruption

Date: Mar 2008. Time: 1 week. Difficulty: Medium.

The NVRAM would be corrupted on powerup. This happened once in a blue moon. The obstacle was that it was hard to reproduce. Not for me. I powered up and down the printer for a few days to reproduce it.

When in doubt, use brute force.

I wasn't able to reproduce the corruption on demand, but it did occur. In time, I was able to increase the frequency of the corruption — although I didn't know why. It was interesting to make guesses and chase down dead alleys. (It was also interesting to "discover" how the NVRAM component stored its data. Such knowledge was lost through time...)

I added more and more debugging messages until it showed that the corruption didn't occur at power up, but at power down — the printer powered down before it finished writing to the NVRAM.

Update: With this finding, I let another colleague take over. He didn't want to look at it before because he didn't think it was possible to reproduce it.

Extracting info from log

Date: Feb 2008. Time: 1 week. Difficulty: Easy.

The Ink Delivery System (IDS) component dumped data out in a haphazard manner. To help with tracing a bug, I wrote an awk script to collect the data so that it was easier to do analysis on them.

Sometimes, reformatting the data makes the problem much easier to solve. The script also revealed the serial log overrun due to too much data, making the log useless. (The script does sanity checking.)

I gave some suggestions how to simplify the log while retaining the same data, but I got the response no one had time to change it. Sure, leave it as is and we'll face the same overrun in the future.

Capturing an elusive defect in a log: extremely helpful.

Not able to use the log due to data loss: very annoying.

Not wanting to fix it: priceless.

Improve pen thermal logging

Date: Oct 2007. Time: 1 week. Difficulty: Easy.

The pen thermal log uses up a lot of space. By utilizing more efficient techniques at the expense of code complexity, I was able to reduce the space by 70% to 90%!

In these days of multi-GB memory, it is all too easy to use multi-dimensional arrays. However, it's still useful to know how to store data as efficiently as possible.

The main stumbling block was that we cannot spend much time to compact the data because it was received in a time-critical routine. This is just a mental block. We store the data raw in the time-critical routine, but as soon as it's over, we process the data to store it efficiently.

Profile pen state

Date: Oct 2007. Time: 1 day. Difficulty: Easy.

Flushed with success from optimizing the APA timing, I decided to add profiling code to the pen state code. This allows us to identify any potential bottlenecks.

I don't like the way the pen code looks up the pen state — even though I was the one who wrote it! (It was okay for the initial implementation, but it was way overdue for improvement.) It does a linear search through the entire table every time. It doesn't cache or move frequently accessed states to the front.

Surprisingly, the lookup was pretty quick, so I didn't suggest making any changes.

I had suggested how to improve the pen state routine in the past. It was too fragile. Every state change combination needs to be inside, or the printer would assert. Why not introduce one or more safe states, so that the printer can always switch from one state to another, either by going directly (if programmed), or by going through the safe states (if not handled)?

Improve APA timing

Date: Sep 2007. Time: 2 weeks. Difficulty: Easy.

I managed to reduce the APA timing from 126 seconds to 97 seconds. First, I profiled the code to find out the bottlenecks. Then, I tackled the two biggest bottlenecks.

After a quick round of (manual) profiling, I discovered there are three main bottleneck processes: generation of the rasters (60%), compressing the rasters (25%) and sending the rasters to Datapath (15%).

(Note: the breakdown is approximate.)

I had initially blamed the last process for the slow speed — because I knew it was inefficient. It turned out there were even bigger bottlenecks!

I improved the raster generation by better filtering of objects to be generated. With this, I was able to filter off 60% of the objects most of the time. The breakdown changed to 20-50-30.

I then thought how to improve the compression code. It made only one pass through the raster and that's about as fast as it can run. How about scanning four bytes at a time? The compression dropped about 33%, but we gained 25% in time. A worthwhile tradeoff. The breakdown changed to 33-33-33.

What's the best that can be done? I got a timing of 80 seconds if I send the APA plot as a PCL print file. Thus, we can shave off another 17 seconds at most.

Out-of-box event log

Date: Apr 2007. Time: 1 week. Difficulty: Easy.

Created out-of-box event log to help technical users diagnose what the printer is doing during out-of-box.

VM, take 2

Date: Oct - Dec 2006. Time: 3 months. Difficulty: Easy.

Implemented a Virtual Machine to allow APA to be scripted. Even more wise and less ambitious than my last attempt, I made the VM very simple and used BASIC scripts to help make the VM assembler language more high level.

It took a long time as the infrastructure and framework was added and defined as I went along. The core design managed to withstand the test of time.

Tracing an elusive assert

Date: Mar 2006. Time: 2 days. Difficulty: Easy.

Assigned to track down an elusive semaphore releated assert. It's rare, but it happened sufficiently during a multi-day stress test to be flagged out.

First thing I did was to do a stack dump upon assert to get a clearer picture. I was able to convert it to a stack trace using the utility I wrote earlier on (the map files have the same format even though this is for the Sirius firmware).

Found the danger of specifying timeouts in taking semaphores and then not handling timeouts. Also unit is a problem. Is the timeout in us or ms? Got to watch out.

Stack trace for ARM-based firmware

Date: Feb 2006. Time: 2 days. Difficulty: Easy.

Our new Sirius3 firmware enables the stack dump once again. And when you have the stack dump, it's natural to convert it to a stack trace. I repeat what I did the last time, except this time I wrote it using shell script/awk instead of C.

The effectiveness of inlining

Date: Oct 2005. Time: 2 days. Difficulty: Easy.

At long last, the project is coming to an end. Finally, I can take a look at optimization. There are very few parameters to tweak (the compiler does a good job), but one of them is inlining via the C++ inline keyword.

My aim is to inline very small functions, those that will reduce both space and time. I looked at the assembler output to evaluate the effectiveness. What I found surprised me. Remember the first rule of optimization, don't.

Virtual Machine

Date: Jul - Aug 2005. Time: 3 - 4 weeks. Difficulty: Medium.

Slightly wiser now, I try again to create a VM-based scripting language. This time, I define and create the VM first. The instruction set is a mixture of ColdFire and MIPS. I managed to write a simple (but slow) VM and an even simpler assembler (no compiler this time). After that, I had no more time to pursue this. I will follow up on this in the future.

Pass double quotes into makefile correctly

Date: Aug 2004. Time: 3 days. Difficulty: Easy.

After we switch to our new build environment, the following command no longer worked:

  makes DEPEND_MODULES="cpnt_a cpnt_b"

The reason is that makes is now a shell script and it strips away the double quotes while looking for its own options. The real makes sees this:

  makes DEPEND_MODULES=cpnt_a cpnt_b

It took a while for me to find a solution, because I couldn't think of a way to preserve the double quotes and retaining the makes shell script functionality. I persisted because I believe there's nothing you can't do using a shell script. :-)

Remote Monitor

Date: Mar 2004. Time: 2 weeks. Difficulty: Medium.

The existing minimon (printer shell) has several commands that act on memory. They are, d to dump memory, e to edit the memory, and dof to call a function. All of them are pretty useful, but dof turned out to be surprisingly useful for me because many functions are not exposed otherwise. I'm now able to call them without recompiling and reflashing the firmware.

Unfortunately, these commands requires a numeric address, so you got to lookup the address from the symbol table and enter it manually. And if you enter it even slightly wrong, the program will hang. This is error-prone and repetitive. So, why not let the printer do the lookup for you?

I modified the commands to do a name-lookup via the network. Now you can enter the function or variable names directly!

In my implementation, the symbol-lookup program on the client (my Unix workstation) connects to the printer. It works this way because I don't want to hardcode the server in the printer.

The code was so far from production quality that I never checked it in. Very few people knew about this. None of them were very impressed. I can't blame them, because they thought of it as a debugging aid and they don't need to debug this way. (They rather change the source.)

But I has more in mind than this. Having access to all available functions means we can do remote scripting.

[Dec 2005] As it turned out, I did not have time to follow up on this again in my present company. I will try again in my next company.

VM-based scripting language

Date: Jan - Feb 2004. Time: 5 - 6 weeks. Difficulty: Medium.

I have always wanted to create a Virtual Machine. I want a small, fast, and register based VM. I started with creating a compiler that compiles pseudo C code to the VM assembler language. The compiler supports the usual stuff: variables, conditionals, loops, functions, but it remained essentially a toy program. Creating a VM from scratch is a lot of work!

Fixed make files

Date: Jan 2004. Time: 3 days. Difficulty: Easy.

Our make file dependencies were not correct for some cases. For example, we could not build just one C++ module, although we could do so for one C module. We needed to build all the modules, which was slow. No one bothers to fix it, until now.

Eliminating duplicate symbols during linking

Date: Jan 2004. Time: 4 days. Difficulty: Easy.

Why do people ignore compiler/linker warning? As long as it works? I fixed this and got some shocking findings.

Threads and resources

Date: Jan 2004. Time: 2 days. Difficulty: Easy.

Can threads be started before their resources are allocated? Not surprisingly, the answer is yes. Somehow, I had the bad luck to encounter this logic bug, so it's up to me to fix it.

Finding null pointer accesses

Date: Dec 2003. Time: 6 weeks. Difficulty: Hard.

We have no memory protection, so null pointer accesses are undetected. How do we find such bugs then? How many are there? And can the CPU generate spurious null pointer accesses? The answers will surprise you.

We figured a way to generate an interrupt on a null pointer access (NPA). Then, we were able to track down all task-level NPAs and proved that our interrupt handlers were free from NPA. But before we could do all these, we had to track down spurious NPAs that were obscuring the real NPAs.

I had to use almost every trick I knew: hooking to interrupt handlers at run-time, hooking to OS task-switch facility to monitor which task was active when an interrupt happened, disassemblying object code, reverse engineering the interrupt stack to get user-mode program counter (PC), mapping PC to exact line in source code (no debugger!).

Program with accessory card hangs during operation

Date: Nov 2003. Time: 3-1/2 weeks. Difficulty: Hard.

Our program hang sometimes when an accessory card is present. We suspected this was a hardware related problem, but in the absence of hardware diagnostic tools, we needed to use software methods to find the cause.

This would have been much simpler if the crash wasn't fatal (we had to power-cycle), if one compilation cycle wasn't 6-1/2 minutes, and for the fact that it just wouldn't crash sometimes. This was also a lesson in doing post-crash analysis.

The hang was because an external chip kept interrupting the CPU when it got an illegal access and we weren't able to clear the chip's status bit. Or rather, assumed not able to. Later, I stared at the code, read the specs carefully and saw the status bit was cleared wrongly! Still, the illegal access was still present and we had to solve it. That took another week before we traced it to an improperly programmed DMA channel.

A look at code dependency

Date: Oct 2003. Time: . Difficulty: Easy.

Modules often get more tightly coupled as time goes by due to the lack of review and refactoring. Are our modules tightly coupled? We can argue about it all day, but I decided to go straight to the object files to find out. ddump and a few awk scripts do the trick.

Program hangs when upgraded

Date: Sep 2003. Time: 2 weeks. Difficulty: Medium.

We can upgrade our program while it is running. Sometimes it hangs. Why? We either ignore the problem or worked around it (although we don't exactly know why it worked). "What goes away on its own will come back on its own." Thus, it is important we nail it for once and for all.

Before I got it fixed, I found that a delay in a certain place seemed to make the hang go away. How sure are we? I wrote a script to upgrade the program every 3 minutes and ran it 3000 times. This was not the fix, but it was good enough then.

Although the delay eliminated the hang, I wasn't satisfied and spent more time to find the real cause. It paid off later when the hang occurred even with the delay in a new build.

User Interface module memory leak

Date: Sep 2003. Time: 2 days. Difficulty: Easy.

The User Interface module crashed under certain circumstances due to memory leaks. This caught my attention because I thought I have eliminated the last of the memory leaks.

We couldn't replicate the memory leak for a while, until I discovered how to do so by chance. It helped that my debugging tools are superior than the others (with the TCP/IP CLI and so on). :-)

As an aside, due to the time pressure, the developer proposed clearing the User Interface module's memory pool when the pool is full. I didn't sound any warning, but I knew it would not be trivial due to the use of pointers all over the (module's) code. As I expected, it didn't survive the stress test.

Build process fails for new development systems

Date: Sep 2003. Time: 1 day. Difficulty: Easy.

One of our scripts checks for OS versions 9 and 10, but not the latest 11. Hence, the build failed for the new development systems, running OS version 11. I found the cause, but I didn't correct it. I think the way we corrected it was to check for version 11. Let's hope there is no version 12.. :-)

The question here is not so much the poorly written script, but why didn't anyone make an effort to decipher the make file?

A note on troubleshooting in the field

Date: Sep 2003. Time: 3 days. Difficulty: Easy.

One thing I hate about troubleshooting in the field is the lack of diagnostic tools. My wishlist and how they will work.

Issues from mixing h/w and emulated floating point

Date: Aug 2003. Time: 1 week. Difficulty: Medium.

Playing tricks with the preprocessor is a dangerous game. Here, I found out how we abused it to use h/w and emulated floating point functions in different parts of the program.

Compiler optimization bug

Date: Aug 2003. Time: 1 week. Difficulty: Medium.

The compiler that we use emits the wrong instruction sequence under some circumstances. We worked around it by using compilation flags. This time, I am determined to find the exact code that will cause it to happen.

What is the effective clock rate?

Date: Jul 2003. Time: 3 days. Difficulty: Easy.

Our clock drifted because the clock frequency was programmed wrongly. We needed to compare it to an external reference clock to find out the correct clock frequency. The question was not whether it can be done, but how to do it effortlessly.

Also, I found out about the lack of a RTC (real-time clock) and how it affected us. Who made the decision to leave it out? Did he fully understand the implications?

Without the RTC, we needed to get the real time from an external source. Once I found out how we did so, I questioned it immediately — it would have never gotten passed on my watch. From then on, I always maintained that we need people with Computer Science background so that we can develop proper algorithms.

Tracking down memory leak per run

Date: Jul 2003. Time: 2 weeks. Difficulty: Medium.

The free system memory decreased by 16 bytes after every run. Our program runs forever, so memory leaks are fatal. Rather than renaming every allocation function and recompiling the entire source code, I decided to hook to the allocation functions at run-time, with the help of an assembly language wrapper function.

I found that I really like the ColdFire instruction set.

A look at code size

Date: Jun 2003. Time: 3 days. Difficulty: Easy.

Our program is very huge. What is the breakdown by module? A little ddump of the object files and awk tells us.

Command-line interpreter over TCP/IP

Date: May - Jun 2003. Difficulty: Medium.

I dislike the design of the existing serial-based minimon (mini-monitor) -- although it is entirely functional -- and set out to write my own socket-based CLI (command-line interpreter). I like it, although it isn't really much better than minimon. Its most impressive feature is allowing remote diagnosis over the network. Previously, we need to be beside the product. To avoid rewriting minimon commands, I redirected minimon to a socket -- without changing its source code! :-) Ah, knowing the OS inside out does pay off sometimes.

Stack dump to stack trace

Date: May 2003. Time: 3 days. Difficulty: Easy.

Our program can finally dump its stack on an assert(). But a raw stack dump is hard to read. When you have the symbol table, you can convert it to a stack trace with programmer-friendly function names. Somebody got to do the work. :-)

Counting build warnings

Date: Apr 2003. Time: 2 days. Difficulty: Easy.

I sometimes stare idly as the compilation scrolls slowly off the screen. How many warnings are there? Which module contributes the most warnings? I wrote a awk script to find out.

(void *) &NHY;