Contrary to what you may be thinking, this is not a tale of an inexperienced coder pretending to know what they’re doing. I have something even better for you.
It all begins in the dead of night, at my workplace. In front of me is a typical programmer’s desk - two computers, three monitors (one of which isn’t even plugged in), a mess of storage drives, SD cards, 2FA keys, and an arbitrary RPi 4, along with a host of items that most certainly don’t belong on my desk, and a tangle of cables that would give even a rat a migraine. My dev laptop is sitting idle on the desk, while I stare intently at the screen of a system running a battery of software tests. In front of me is the logs of a failed script run.
Generally when this particular script fails, it gives me some indication as to what went wrong. There are thorough error catching measures (or so I thought) throughout the code, so that if anything goes wrong, I know what went wrong and where. This time though, I’m greeted by something like this:
$ systemctl status test-sh.service
test-sh.service - does testing things
...
May 20 23:00:00 desktop-pc systemd[1]: Starting test-sh.service - does testing things
May 20 23:00:00 desktop-pc systemd[1]: test-sh.service: Failed with result ‘exit-code’.
May 20 23:00:00 desktop-pc systemd[1]: Failed to start test-sh.service.
I stare at the screen in bewilderment for a few seconds. No debugging info, no backtraces, no logs, not even an error message. It’s as if the script simply decided it needed some coffee before it would be willing to keep working this late at night. Having heard the tales of what happens when you give a computer coffee, I elected to try a different approach.
$ vim /usr/bin/test-sh
1 #!/bin/bash
2 #
3 # Copyright 2024 ...
4 set -u;
5 set -e;
Before I go into what exactly is wrong with this picture, I need to explain a bit about how Bash handles the ultimate question of life, “what is truth?”
(RED ALERT: I do not know if I’m correct about the reasoning behind the design decisions I talk about in the rest of this article. Don’t use me as a reference for why things work like this, and please correct me if I’ve botched something. Also, a lot of what I describe here is simplified, so don’t be surprised if you notice or discover that things are a bit more complex in reality than I make them sound like here.)
Bash, as many of you probably know, is primarily a “glue” language - it glues applications to each other, it glues the user to the applications, and it glues one’s sanity to the ceiling, far out of the user’s reach. As such, it features a bewildering combination of some of the most intuitive and some of the least intuitive behaviors one can dream up, and the handling of truth and falsehood is one of these bewildering things.
Every command you run in Bash reports back whether or not what it did “worked”. (“Worked” is subjective and depends on the command, but for the most part if a command says “It worked”, you can trust that it did what you told it to, at least mostly.) This is done by means of an “exit code”, which is nothing more than a number between 0 and 255. If a program exits and hands the shell an exit code of 0, it usually means “it worked”, whereas a non-zero exit code usually means “something went wrong”. (This makes sense if you know a bit about how programs written in C work - if your program is written to just “do things” and then exit, it will default to exiting with code zero.)
Because zero = good and non-zero = not good, it makes sense to treat zero as meaning “true” and non-zero as meaning “false”. That’s exactly what Bash does - if you do something like “if command; then commandIfTrue; else commandIfFalse; fi
”, Bash will run “commandIfTrue
” if “command
” exits with 0, and will run “commandIfFalse
” if “command
” exits with 1 or higher.
Now since Bash is a glue language, it has to be able to handle it if a command runs and fails. This can be done with some amount of difficulty by testing (almost) every command the script runs, but that can be quite tedious. There’s a (generally) easier way however, which is to tell the script to immediately exit if any command exits with a non-zero exit code. This is done by using the command “set -e
” at or near the top of the script. Once “set -e
” is active, any command that fails will cause the whole script to stop.
So back to my script. I’m using “set -e
” so that if anything goes wrong, the script stops. What could go wrong other than a failed command? To answer that question, we have to take a look at how some things work in C.
C is a very different language than Bash. Whereas Bash is designed to take a bunch of pieces and glue them together, C is designed to make the pieces themselves. You can think of Bash as being a glue gun and C as being a 3d printer. As such, C does not concern itself nearly as much with things like return codes and exiting when a command fails. It focuses on taking data and doing stuff with it.
Since C is more data- and algorithm-oriented, true and false work significantly differently here. C sees 0 as meaning “none, empty, all bits set to 0, etc.” and thus treats it as meaning “false”. Any number greater than 0 has a value, and can be treated as “on” or “true”. An astute reader will notice this is exactly the opposite of how Bash works, where 0 is true and non-zero is false. (In my opinion this is a rather lamentable design decision, but sadly these behaviors have been standardized for longer than I’ve been alive, so there’s not much point in trying to change them. But I digress.)
C also of course has features for doing math, called “operators”. One of the most common operators is the assignment operator, “=
”. The assignment operator’s job is to take whatever you put on the right side of it, and store it in whatever you put on the left side. If you say “a = 0
”, the value “0” will be stored in the variable “a” (assuming things work right). But the assignment operator has a trick up its sleeve - not only does it assign the value to the variable, it also returns the value. Basically what that means is that the statement “a = 0
” spits out an extra value that you can do things with. This allows you to do things like “a = b = 0
”, which will assign 0 to “b”, return zero, and then assign that returned zero to "a”. (The assignment of the second zero to “a” also returns a zero, but that simply gets ignored by the program since there’s nothing to do with it.)
You may be able to see where I’m going with this. Assigning a value to a variable also returns that value… and 0 means “false”… so “a = 0
” succeeds, but also returns what is effectively “false”. That means if you do something like “if (a = 0) { ... } else { explodeComputer(); }
”, the computer will explode. “a = 0
” returns “false”, thus the “if” condition does not run and the “else” condition does. (Coincidentally, this is also a good example of the “world’s last programming bug” - the comparison operation in C is “==
”, which is awfully easy to mistype as the assignment operator, “=
”. Using an assignment operator in an “if” statement like this will almost always result in the code within the “if” being executed, as the value being stored in the variable will usually be non-zero and thus will be seen as “true” by the “if” statement. This also corrupts the variable you thought you were comparing something to. Some fear that a programmer with access to nuclear weapons will one day write something like “if (startWar = 1) { destroyWorld(); }
” and thus the world will be destroyed by a missing equals sign.)
“So what,” you say. “Bash and C are different languages.” That’s true, and in theory this would mean that everything here is fine. Unfortunately theory and practice are the same in theory but much different in practice, and this is one of those instances where things go haywire because of weird differences like this. There’s one final piece of the puzzle to look at first though - how to do math in Bash.
Despite being a glue language, Bash has some simple math capabilities, most of which are borrowed from C. Yes, including the behavior of the assignment operator and the values for true and false. When you want to do math in Bash, you write “(( do math here... ))
”, and everything inside the double parentheses is evaluated. Any assignment done within this mode is executed as expected. If I want to assign the number 5 to a variable, I can do “(( var = 5 ))
” and it shall be so.
But wait, what happens with the return value of the assignment operator?
Well, take a guess. What do you think Bash is going to do with it?
Let’s look at it logically. In C (and in Bash’s math mode), 0 is false and non-zero is true. In Bash, 0 is true and non-zero is false. Clearly if whatever happen within math mode fails and returns false (0), Bash should not misinterpret this as true! Things like “(( 5 == 6 ))
” shouldn’t be treated as being true, right? So what do we do with this conundrum? Easy solution - convert the return value to an exit code so that its semantics are retained across the C/Bash barrier. If the return value of the math mode statement is false (0), it should be converted to Bash’s concept of false (non-zero), therefore the return value of 0 is converted to an exit code of 1. On the other hand, if the return value of the math mode statement is true (non-zero), it should be converted to Bash’s concept of true (0), therefore the return value of anything other than 0 is converted to an exit code of 0. (You probably see the writing on the wall at this point. Spoiler, my code was weighed in the balances and found wanting.)
So now we can put all this nice, logical, sensible behavior together and make a glorious mess with it. Guess what happens if you run “(( var = 0 ))
” in a script where “set -e
” is enabled.
“0” is assigned to “var”.
The statement returns 0.
Bash dutifully converts that to a 1 (false/failure).
Bash now sees the command as having failed.
“
set -e
” says the script should immediately stop if anything fails.The script crashes.
You can try this for yourself - pop open a terminal and run “set -e; (( var = 0 ));
” and watch in awe as your terminal instantly closes (or otherwise shows an indication that Bash has exited).
So back to the code. In my script, I have a function that helps with generating random numbers within any specified bounds. Basically it just grabs the value of “$RANDOM
” (which is a special variable in Bash that always returns an integer between 0 and 32767) and does some manipulations on it so that it becomes a random number between a “lower bound” and an “upper bound” parameter. In the guts of that function’s code I have many “math mode” statements for getting those numbers into shape. Those statements include variable assignments, and those variable assignments were throwing exit codes into the script. I had written this before enabling “set -e
”, so everything was fine before, but now “set -e
” was enabled and Bash was going to enforce it as ruthlessly as possible.
While I will never know what line of code triggered the failure, it’s a fairly safe bet that the culprit was:
88 (( _val = ( _val % ( _adj_upper_bound + 1 ) ) ));
This basically takes whatever is in “_val” , divides it by “_adj_upper_bound + 1”, and then assigns the remainder of that operation to “_val”. This makes sure that “_val” is lower than “_adj_upper_bound + 1”. (This is typically known as a “getting the modulus”, and the “%” operator here is the “modulo operator”. For the math people reading this, don’t worry, I did the requisite gymnastics to ensure this code didn’t have modulo bias.) If “_val” happens to be equal to “_adj_upper_bound + 1”, the code on the right side of the assignment operator will evaluate to 0, which will become an exit code of 1, thus exploding my script because of what appeared to be a failed command.
Sigh.
So there’s the problem. What’s the solution? Turns out it’s pretty simple. Among Bash’s feature set, there is the profoundly handy “logical or operator”, “||
”. This operator lets us say “if this OR that is true, return true.” In other words, “Run whatever’s on the left hand of the ||. If it exits 0, move on. If it exits non-zero, run whatever’s on the right hand of the ||. If it exits 0, move on and ignore the earlier failure. Only return non-zero if both commands fail.” There’s also a handy command in Bash called “true
” that does nothing except for give an exit code of 0. That means that if you ever have a line of code in Bash that is liable to exit non-zero but it’s no big deal if it does, you can just slap an “|| true
” on the end and it will magically make everything work by pretending that nothing went wrong. (If only this worked in real life!) I proceeded to go through and apply this bandaid to every standalone math mode call in my script, and it now seems to be behaving itself correctly again. For now anyway.
tl;dr: Faking success is sometimes a perfectly valid way to solve a computing problem. Just don’t live the way you code and you’ll be alright.