Searching for a Variable String with Spaces in AWK

by Barron Bichon


I had a hell of a time figuring out how to get an awk command to work today. Here’s what happened.

I have a text file that looks like this:

Number       Peak Force         Energy  Mean Pressure  Name
     1           14.692        117.728        587.977  01 Cylinder Flange
     2          536.018         43.994        587.956  02 Bottle Cap
     3          534.897         42.768        587.955  03 Bottle Cap
     4           21.231         21.175        587.966  04 Bottle Nozzle
     5           11.211         18.024        588.062  05a Half Lateral
     6           14.246         23.892        587.966  05b Quarter Lateral
     7            1.948          5.834        596.190  06 Tee
     8            7.698          0.698        596.248  07 Header

And I want to write a bash script that allows me to sum the Energy amongst any subset of the test points and (for reasons that aren’t important here, but just trust me) the test points must be defined by Name. You can imagine that if you were using awk directly from the command line and you wanted to know the Energy at a given test point, you could simply do something like:

awk '/02 Bottle Cap/ {print $3}' file

However, I won’t know the names in advance. They’ll be passed as arguments to the script, which means that I have variable strings, and worse, I have strings with spaces in them. For instance, if I wanted to sum the energy for the test points with Numbers 4 through 7, I’d execute something like this from the command line:

sum_energy.sh "04 Bottle Nozzle" "05a Half Lateral" "05b Quarter Lateral"

I spent a long time fighting with all sorts of combinations of single quotes, double quotes, brackets, braces, and dollar signs to get this to work, but I’ll save you the headache. Here is the answer[1]:

sum=0
for i in "$@"; do
    energy=$(awk '/'"$i"'/ {print $3}' $file)
    sum=$(echo "scale=4; $sum + $energy" | bc)
done
echo $sum

I was surprised at how little help I could find on the internet for working with variables in awk commands. Eventually, I found a page on Heiner’s SHELLdorado, which led me to the answer above. So this blog post serves two purposes: a) it will help me find the solution and re-solve this problem should it arise again in the future, and b) hopefully this additional link to SHELLdorado will help that page appear more prominently in future Google searches for everyone.


  1. I am almost positive this could be converted to an awk one-liner, but I really don’t care. I know I could have awk do the summation, and probably execute the loop, too, but this is readable and, really, by the time I got this working, I’d had enough.  ↩


NFL Stance

by Barron Bichon


Today, Dan Lewis, in his great Now I Know daily email[1] wrote about a strange occurrence in the broadcast of a famous Super Bowl game. As usual, it was an interesting story, but my favorite part was the Bonus Fact:

Super Bowl XXVII took place on January 31, 1993, in Pasadena, California. But it was originally planned to take place in Tempe, Arizona. In March of 1990, the NFL awarded the game to Tempe, under the condition that the state of Arizona begin to celebrate Martin Luther King, Jr.’s birthday. But later in that year, voters rejected the holiday via statewide referendum (by overwhelming majority — 76% against), and the NFL made good on its threat.

It’s fascinating to me that a) the NFL made this request, b) they had the guts to stick with it and really move the Super Bowl when the request wasn’t met, and c) the people of Arizona had such a problem with this holiday and apparently with MLK himself - things get even weirder when you read more about it[2].


  1. Subscribe. No, really. Thank me later.  ↩

  2. That link was also provided by Dan. Seriously, subscribe.  ↩


CCC, See?

by Barron Bichon


This morning I read through Dr. Drang’s detailed write up on how he uses a combination of shell scripts, plist files, launchd, Lingon, and Super Duper to perform nightly backups. It is an interesting read and a creative solution, but mostly it just made me happy that I use Carbon Copy Cloner instead of Super Duper.[1] My setup is much simpler.

The central problem is that the good doctor and I prefer to keep the backup drive unmounted, mount it immediately before the backup starts, and unmount it immediately after it completes.[2] Apparently this is not easy to do with Super Duper. With CCC, though, it is all but done for you.

First, “CCC will attempt to mount your source and destination volumes before the backup task begins” automatically with no need to set anything or call a script. Second, when scheduling a task, in the “Before & After” tab, there is a pre-canned option to “Unmount the destination volume” after the backup completes. Select this and your done. You can see these options highlighted in the screenshot below.


So now some more detail on how I use CCC. On my Mac Pro (not the computer using the setup shown below) I have three hard drives: the boot drive, a Time Machine backup drive, and a clone drive created by CCC. This gives me hourly[3] backups of incremental changes via Time Machine, plus a bootable exact clone that is created nightly. This has me covered in case a) I stupidly delete a file I wish I could get back or b) I have a drive failure. I also have another nightly clone made to a remote drive in case the building burns down.

For my laptop (and currently my main computer), I have to do something different because I obviously don’t have two additional drives installed. This computer is an 11“ MacBook Air connected to a Thunderbolt display. The display then has the backup drive connected to it (also via Thunderbolt; I bought this one). So I only have one drive, but I’d like to have both a bootable clone and incremental backups. This brings me to another reason I love CCC: when defining the backup job to run there is an option to archive older versions of updated files. These are placed alongside the rest of the clone drive’s contents in a ”_CCC Archives" directory. Within this are date-stamped directories, each with the older versions of the files with their complete directory structure intact. If I need to restore a file, I only need to mount the drive and navigate through this to find it and copy it. Because this is how I get an incremental backup, it has to run more often than nightly. You can see in the screenshot[4] below that it runs every 3 hours, mimicking (to the extent I can) my Mac Pro[5] set up.


  1. I’ve never understood why Super Duper is so much more popular than CCC. They solve essentially the same problem, but CCC has been rock solid, is easy to work with, is well supported, and for a long time it was free (though I’m glad to see it isn’t anymore; I donated long ago).  ↩

  2. There are myriad reasons for this, but the reason this is an absolute necessity for me is to prevent the App Store app from thinking I need to update software that it finds on the backup drive, which I find incredibly annoying.  ↩

  3. For what I do, hourly backups are overkill and I was annoyed by the additional drive thrashing, so I use TimeMachineEditor to set this at a pace that is better for my usage case and personal proclivities.  ↩

  4. Yes, the computer is named Tyrion. It is a small, but powerful, computer and I am a Game of Thrones fan.  ↩

  5. The Mac Pro is named MacGyver.  ↩

ccc.png

Falling in Love with Python

by Barron Bichon


As I mentioned in a previous post, I know my way around Python, but I’m still learning the right way to use it. I’m still exploring the strength and benefits of the language, how it simplifies some procedures that are tedious or difficult in other languages, etc. To accelerate this learning, I’m forcing myself to explore Python first when confronted with a new coding task. It’s going really well.

In fact, I’m falling in love with Python. And like anyone falling madly in love, I want to scream it from the rooftops and blabber on about all the beautiful features that make up this object of my affection to anyone and everyone who will listen. This is the first of such posts; I’m sure there will be more.

To be clear, there is nothing groundbreaking here. This is Python 101 stuff. If you are already familiar with the language, you might even find this boring (though I weep for your hardened heart that you can no longer find joy in these niceties). These are the little things that have really stood out to me that will keep me coming back to Python over and over. On to the code.

Recently, I’ve had cause to work with a colleague to create a DOE for a client. The details of this aren’t important (which is good because I couldn’t tell you anyway), but once we have the list of experiments we want to perform, we need to schedule the work.

Imagine we are trying to understand how variations in a cookie recipe affect the taste. We want to look at many different variables: oven temperature, cook time, salt quantity, number of chocolate chips, cookie size, flour type, and maybe a few other things. There are lots of different combinations of these parameters, so we have a lot of cookies to make. Whatever we can do to group these into batches will really help us execute the DOE more quickly. An obvious idea is to bake as many cookies together as possible, so it would be useful group them by the number of unique combinations of oven temperature and cook time (regardless of the other parameters).

We’ve created a recipes NumPy array such that each row describes a single unique recipe and each column lists the particular value for each of the parameters to be used in a given recipe. For example, recipes[0,0] is the oven temperature (column 0) for the first recipe (row 0) and recipes[-1,1] is the cook time (column 1) for the last recipe (row “–1”[1]).

So how do we find all the unique combinations of oven temperature and cook time so that we can create our batches? You can envision the complexity of loops and if statements that might be required to solve this in a lot of languages. But in Python (mon amour), we need only this:

batches = sorted(set(zip(recipes[:,0],recipes[:,1])))

It’s that easy. Let’s consider this from the inside-out. zip(recipes[:,0],recipes[:,1]) creates a list of tuples from the first two columns (“0” and “1”) over all rows (“:”) of recipes. For our case of oven temps and cook times, this might look like:

[(375, 25), (350, 22), (375, 22), (400, 25), (350, 22), (400, 25), ... ]

where we have one tuple for each recipe. The set command then finds all the unique tuples in this list. The addition of sorted isn’t strictly necessary, but it makes the list a little easier to read and it’s just so easy, why not? Now, if we print batches we get:

[(350, 22), (350, 25), (375, 22), (375, 25), (400, 22), (400, 25)]

Boom. Awesome. The simplicity (and readability) of using that one line of code to perform this complex series of operations blew me away. This was enough to hook me, but we need to go further. We now know how we can batch the recipes, but we haven’t yet determined which recipes go in each batch, which we’ll need so we can schedule the baking. If we wanted to print a list of recipes that can be baked together in each batch, we might do something like:

for batch in batches:
    bake_together = []
    for r in recipes:
        if (r[0],r[1]) == batch:
            bake_together.extend(r[2:])
    print batch, bake_together

This is perfectly serviceable code and it is technically correct, but it’s a good example of what I mean by not coding Python the right way. Once you learn how, Python makes this a lot easier through the use of a list comprehension. Lines 2–5 of the above can be replaced with a single line:

for batch in batches:
    bake_together = [r[2:] for r in recipes if (r[0],r[1]) == batch]
    print batch, bake_together

It’s easy to see how these are equivalent and, once you become accustomed to it, easier to read. To make a bad pun, this type of syntactic sugar is just so, so sweet.


  1. How cool is it that you can use negative indexes to access entries from the end of a list? I could do a whole post just on this.  ↩


Think Stats

by Barron Bichon


Recently O’Reilly ran a sale on “data science” books to celebrate the Hadoop World conference. I picked up several. I’ve been meaning to pick up Python for Data Analysis and Machine Learning for Hackers; when I saw these for 50% off, it was a no-brainer. And at that price, I couldn’t help but impulse-buy a few others. Among them was Think Stats by Allen Downey. Because I expected this book to largely be review material for me, I started there.

Think Stats serves as an able introduction to a broad range of statistical topics, from the very basics of computing the mean and variance of a data set to more complicated topics like Bayesian estimation. The author’s style is to teach through analogy and example, which suits my learning style well. I’d rather jump in and get my hands dirty with some code to solve a toy problem than spend pages reading through the equations that describe the theory mathematically.

The coding examples are heavy in the early chapters and, to be quite honest, I struggled with them a bit. The author suggests working the examples in Python and provides his solutions to them via his website.[1] I know the basics of Python, but I’d say I’m still in the process of learning the right way to code in Python. So I hacked together my perfectly serviceable solutions, went to compare them to Downey’s code, and was lost. It looked nothing like what I had written. I’m quite certain my implementation is the inferior, but regardless, this caused a bit of a hiccup for me in processing the material. That said, if you are writing code to read and process data and your first impulse is to create a new type of data object to store them and then implement all the operations as attributes of that new object and then create a function that performs all the operations you need so that you can write a one line main, then you’ll be perfectly at home.[2]

But that’s my only gripe. The book is really quite effective and the examples, stories, and analogies the author uses make it an enjoyable read. Some of my favorite parts:

  • The Monty Hall problem. I don’t know how I haven’t heard of this before. It is apparently a classic probability puzzle, but this was my first time hearing it. It’s a great example of how probabilities can be grossly non-intuitive, but nonetheless, correct.
  • Histograms are not the paragon of data visualization. The book details how to create histograms and PMFs, but makes it clear why they aren’t always sufficient and how the CDF can be a better option. I see histograms a lot in situations where they don’t belong, so I was happy to see this discussion.
  • Bayesian drug screens. I found the drug screen example, with the breakdown between the sensitivity and the specificity of the test, to be a particularly effective way to teach Bayes’s theorem. I’ll be stealing this for sure.
  • Matplotlib. As expected, matplotlib is used throughout the book for plotting, but especially interesting were the various ways to create a scatter plot and how they can influence the viewer’s interpretation of the data. I hadn’t seen matplotlib’s hexbin capability before.
  • Correlations. The author does a great job making it clear that Pearson’s correlation only describes the strength of linear relationships (but is independent of the slope of that relationship). These details are often missed or glossed over and can be tricky to teach (I’ve tried).
  • Wikipedia. There are many many links to Wikipedia throughout the book (these are especially useful in the ebook version). It is, as Downey says, “the bugbear of librarians everywhere”, but there is no denying its usefulness. I found these links to additional information to be very helpful.

If you are looking for an introduction to statistics or you want to brush up on what a p-value is because you haven’t thought about it in a long time, Think Stats is a great option. I’ll definitely be keeping it on my shelf.[3]


  1. Is there any doubt that the internet has completely changed education?  ↩

  2. And who knows? Maybe after I dig in and work through books like Python for Data Analysis this will be my predilection as well.  ↩

  3. So to speak. I bought the ebook version, so I guess it’s on a virtual shelf.  ↩