What you don’t learn in grad school.

Posted

Developing robust software that is error free is one of the most important skills that everyone should learn in grad school but don’t for a variety of reasons or at least that is my experience. Yours may be different. I will illustrate one context where a simple software engineering idea is very useful in eliminating hard to find bugs in code.

The Context

A project that I worked on while in grad school required me to program a Markov chain Monte Carlo sampler. I will not go into the technical details here but the idea is that somewhere in the code you will have a loop that like looks the one shown below:

Iterate 10000 times:
a = f(b, c)
b = g(a, c)
c = h(a, b)


The difficulty in troubleshooting the code given above is that in each iteration the values for any of the variables $$a$$, $$b$$ and $$c$$ depend on the values for the other two variables. Thus, if the code is not working as expected it is difficult to troubleshoot as the bug could be anywhere. One way to troubleshoot the code would be to freeze the values for two of the variables (say $$a$$ and $$b$$) and to see if values of $$c$$ are as expected. The usual way I used to incorporate the above idea was to introduce Boolean flags as shown below:

Iterate 10000 times:
If generate_a:
a = f(b, c)
If generate_b:
b = g(a, c)
If generate_c:
c = h(a, b)


Using the above idea, I would test the code by setting the appropriate flags as shown below:

generate_a = False
generate_b = False
Iterate 10000 times:
If generate_a:
a = f(b, c)
If generate_b:
b = g(a, c)
If generate_c:
c = h(a, b)


I always found the above very clunky and was not comfortable with the idea at all. The problem with the above idea is that when we are using the code for ‘real’ we would set the flags to be ‘True’ and thus the Boolean variables are redundant. The other issue is that I have to pass the expected values for $$a$$ and $$b$$. I would know these values when I am testing as the tests are run based on simulated data but we would not know these values for a real dataset.

The Software Engineering Idea: Method Stub

As the years went by, out of curiosity and interest, I started reading about software engineering ideas and one of the ideas I encountered was that of a ‘Method Stub’. The idea is that in our test we substitute a ‘fake’ method that always performs as expected instead of the real method. Thus, instead of calling f(b, c), the test would call fake_f(b, c) which always returns the correct value for a. Implementing such a strategy has two benefits: One, we can test each function in isolation by faking the remaining functions and two, we do not need any redundant flags in our actual code and do not need to pass any values for $$a$$, $$b$$ and $$c$$. The testing strategy then becomes:

Write fake_f(b, c) such that it always returns the correct value for a.

Write fake_g(a, c) such that it always returns the correct value for b.

Tell the testing framework: When you encounter f(a, b), use fake_f(b, c) instead of f(b, c)

Tell the testing framework: When you encounter g(a, c), use fake_g(a, c) instead of g(a, c)

Iterate 10000 times:
a = f(b, c)
b = g(a, c)
c = h(a, b)


We now have clean code that can be tested easily without introducing redundant Boolean flags or passing in actual values for $$a$$, $$b$$ and $$c$$.