Rum Boogie Café

Written by Pete Corey on Nov 6, 2017.

Recently I ran into an interesting problem while working on a project with a Memphis-based client. This interesting problem led to several hours of sleuthing through HTTP headers, combing over hex dumps, and pouring through the source of several packages.

At the end of a long day, I came to the conclusion that character encodings are important, if often overlooked things, and that assumptions do indeed make asses out of you and me.

So buckle up, and I’ll tell you a story about Rum Boogie Café! Or is it Rum Boogie Caf�? Or… RUM BOOGIE CAFÉ,"D?

Invalid JSON

The project in question involves streaming massive JSON documents into a Node.js application from a proxy service which in turn pulls the original JSON documents from an external, third-party service.

Once streamed in, the Node.js application parses the incoming JSON with the JSONStream package, and Does Things™ with the resulting data.

This process was working beautifully for several relatively small JSON documents, but when it came time to parse larger, wilder JSON documents served by the third-party service, bugs started to crawl out of the woodwork.

The first sign of trouble was this exception:


Error: Invalid JSON (Unexpected "D" at position 2762 in state STOP)
    at Parser.proto.charError (/project/node_modules/jsonparse/jsonparse.js:90:16)
    at Parser.proto.write (/project/node_modules/jsonparse/jsonparse.js:154:23)
    at Stream.<anonymous> (/project/node_modules/JSONStream/index.js:23:12)
    ...

Well, that seems like an obvious problem. The JSON must be corrupt.

But after taking a look at the raw JSON served from the external service, we can see that the section of the document in question is perfectly well formed:


...,"DESC":"RUM BOOGIE CAFÉ","DEF":"",...

So what gives?

Invalid UTF-8?

Before spending too much time with this issue, I wanted to get more data. Was this a problem with this report specifically, or all larger reports?

I tried to process another similarly large JSON document served by the external service.

This similarly large document resulted in a similar exception:


Error: Invalid JSON (Invalid UTF-8 character at position 3832 in state STRING1)
    at Parser.proto.write (/project/node_modules/jsonparse/jsonparse.js:171:31)
    at Stream.<anonymous> (/project/node_modules/JSONStream/index.js:23:12)
    ...

This time around, the jsonparse package (a dependency of the JSONStream package we’re using) is complaining about an invalid UTF-8 character.

Interesting!


At this point, I have a hunch. Is the data being returned by our proxy service utf-8 encoded?

To find out, I fired up Postman and made a request to the proxy server to pull down the first of the large JSON documents. Interestingly, the HTTP response wasn’t specifying a character encoding, but it was returning a Content-Type of application/json which implies a default encoding of utf-8.

Let’s put this implication to the test.

We can use xxd to dump the raw hex of the JSON document being returned by the proxy service (after saving it to disk):


0074f60: 2042 4f4f 4749 4520 4341 46c9 222c 2244   BOOGIE CAF.","D

Our JSON parser is failing at the D at the end of this line. To verify that this is actually utf-8 encoded text, we’ll copy the relevant hex values for the line into a buffer in a new Node.js program:


let buffer = Buffer.from([
    0x43, // 'C'
    0x41, // 'A'
    0x46, // 'F'
    0xc9, // 'É'
    0x22, // '"'
    0x2c, // ','
    0x22, // '"'
    0x44, // 'D'
]);

Next, we can print the buffer, decoding it as utf-8:


console.log(buffer.toString('utf-8'));

This gives us:


CAF�

The wrong character…

Digging deeper, I realized that the proxy service was mangling the response headers of the external service it was proxying for. Thankfully, this was an easy fix.

Soon the Content-Type header of the newly-fixed proxy service revealed that the JSON documents were encoded with ISO-8859-1 (which Node.js refers to as latin1).


console.log(buffer.toString('latin1'));

Decoding our buffer with latin1 gives us…


CAFÉ

The right character! Victory!

Well, not really; our application is still broken. At least we know that we’re dealing with latin1 encoded text, not utf-8.

Going Spelunking

So now we know that the stream we’re passing into JSONStream, and ultimately jsonparse is latin1 encoded, not utf-8 encoded.

Why is this a problem?

Taking a look at the jsonparse source, we can see quite a few places where the code is making the assumption that any data streamed in will be utf-8 encoded.

Let’s trace through this code and find out what happens when it processes our latin1-encoded É character (remember, É has a hex value of 0xc9 and a decimal value of 201).

Let’s assume we’re in the process of working through a streamed in buffer. Our É character is within a JSON string, so we’d be in the STRING1 state when we encounter it. Let’s also assume we have no bytes_remaining for now.

The value of É is greater than 128 (201), so we’d fall into the “parse multi byte” block. Because É (201) is greater than 194 and less than 223, bytes_in_sequence would be set to 2. A few lines later, this 2 in bytes_in_sequence prompts jsonparse to swallow the next two bytes (",) from the buffer and include them as part of the current string.

Unfortunately, one of the characters that’s mistakenly swallowed is the JSON string’s terminating quote. The parser happily continues on until it finds another quote, the opening quote for the "DEF" string, and uses that as the closing quote for the current string.

At this point, the parser expects a valid starting character like a comma or a closing bracket. Instead, it finds our ill-fated D and throws a familiar exception:


Error: Invalid JSON (Unexpected "D" at position 2762 in state STOP)

Interestingly, the value of the wrongly-encoded character affects the behavior of this bug.

For example, if we were to pass in a non-breaking space with a latin1 character code of 160, an Invalid UTF-8 exception would be thrown.

Similarly, a character like ð with a latin1 character code of 240 would result in four characters being swallowed by the parser.

Fixing the Issue

Now that we know what the problem is, fixing it is simple. We’re streaming ISO-8859-1, or latin1 encoded data from our proxy service, but our streaming JSON parser expects the data to be utf-8 encoded.

We’ll need to re-encode our data into utf-8 before passing it into our parser.

This sounds daunting, but thankfully libraries like iconv-lite make it a very simple process. Especially when you’re using streams.

Assuming that our original setup looks something like this:


getDocumentStream(...)
    .pipe(JSONStream.parse('*'));

We can easily pipe our document stream through a conversion stream before handing it off to our parser:


getDocumentStream(...)
    .pipe(iconv.decodeStream('ISO-8859-1'))
    .pipe(JSONStream.parse('*'));

And with that, all is right in the world. Everything works as expected, and we can get back to bigger and better things.

Final Thoughts

So in hindsight, was this a bug?

Probably not.

Neither the JSONStream documentation nor the jsonparse documentation make it explicit that your stream needs to be utf-8 encoded, but this seems like a reasonable assumption on their end.

Instead, I think this was a complicated set of misunderstandings and faulty assumptions that led to some bizarre, but technically correct behavior.

The moral of the story is that if you’re dealing with strings, you need to know how they’re encoded. Most developers keep string encodings out of sight and out of mind, but when things go wrong they can lead to time consuming and confusing bugs.

Next time you’re in Memphis, be sure to stop by RUM BOOGIE CAFÉ,"D!

Grokking the Y Combinator with Elixir

Written by Pete Corey on Oct 30, 2017.

I’ve been re-reading one of my favorite programming books, The Little Schemer, and I had an epiphany in one of the final chapters.

Among many other mind-bending topics, the book goes through a derivation of the Y combinator. In the past, any technical explanations of the Y combinator have left me with a brain folded into a pretzel and a throbbing headache. This time, however, something clicked.

But with the help of the quirky guiding dialog of The Little Schemer, I feel like I finally understand the Y combinator.

I want to share that understanding with you, my Elixir-loving friend!

In this article we’ll briefly go over what exactly the Y combinator is, and then dive deep into how we can derive it using anonymous Elixir functions. This is definitely a brain-stretcher of an article, and I hope you find it as interesting as I do.

But Why? (Pun Intended)

Why is it important that you know about the Y combinator? Honestly, unless you’re a professional computer scientist, it’s not. When it comes to day-to-day web development, the applicable benefits of knowing what the Y combinator is and how it works are practically nonexistent.

Putting practicality aside, the Y combinator is one of the most interesting objects in all of computer science.

In the most basic sense, the Y combinator allows computer scientists to define recursive functions in Lambda Calculus, a language that fundamentally doesn’t support recursion. The way in which it does this is one of the most clever and intricately beautiful solutions to any programming problem I’ve ever seen.

For me, the Y combinator is something like the Mandelbrot set, or Conway’s Game of Life. It’s something to be marveled at; something to be revered.

Deriving An Example

Let’s start our dive into the world of the Y combinator by coming up with a recursive function we want to build without using recursion!

We’ll use a simple classic:


def length([]),    do: 0
def length([h|t]), do: 1 + length(t)

Defined as an anonymous function, length computes the length of a list:


length([])     # 0
length([1])    # 1
length([1, 1]) # 2, etc...

When length is passed an empty list, we return a length of 0. When its passed a non-empty list, we return 1 plus the length of the tail of the list. We’re able to create this type of recursive definition because when defining functions using def and defp, Elixir is kind enough to allow the defined function to make direct references and calls to itself.

However, what if we were to try to write a recursive anonymous function?


length = fn
  []    -> 0
  [h|t] -> 1 + length.(t)
end

Here we’re binding length to an anonymous function. Within that anonymous function we attempt to recurse by calling length. Unfortunately, at this point length is not in scope, and so Elixir blows up with a compilation error:

** (CompileError) iex:9: undefined function length/0
    (stdlib) lists.erl:1354: :lists.mapfoldl/3
    (stdlib) lists.erl:1355: :lists.mapfoldl/3

Is it possible to write a recursive anonymous function? Put another way, can you write a recursive function without being able to directly refer to yourself?

Yes! Behold the power of the Y combinator!

Down the Rabit Hole

Before we start our deep dive, let’s pull back and simplify things a bit. While our recursive call to our anonymous length function doesn’t work, part of our length function is still correct.

Let’s pull out the recursive piece to highlight what we mean:


length = fn
  []    -> 0
  [h|t] -> 1 + (fn _ -> raise "oops" end).(t)
end

Even with the recursive piece removed, we’re able to calculate the length of empty lists:


length.([])   # 0
length.([1])  # oops

We could add support for lists of length 1 by substituting the function that raises an exception (our “raising function”) with another identical copy of our length function:


length = fn
  []    -> 0
  [h|t] -> 1 + (fn
                  []    -> 0
                  [h|t] -> 1 + (fn _ -> raise "oops" end).(t)
                end).(t)
end

Now our function works for lists of length 0 and 1, but blows up on lists two or more elements long:


length.([])     # 0
length.([1])    # 1
length.([1, 2]) # oops

We could add support for even longer lists by continuously replacing the raising function with additional copies of our length function:


length = fn
  []    -> 0
  [h|t] -> 1 + (fn
                  []    -> 0
                  [h|t] -> 1 + (fn
                                  []    -> 0
                                  [h|t] -> 1 + (
                                                  ...
                                               ).(t)
                                end).(t)
                end).(t)
end

While this works, it’s obviously infeasible. If we wanted to calculate the length of a one thousand element list, we’d need to chain at least one thousand function definitions together.

Not only that, but we’ll always have an upper bound using this technique. At some point, our chain will end and our raising function will throw an exception.

We need to do better.

Don’t Repeat Yourself

If you look at the code above, you’ll notice lots of repetition. At each level of “recursion” we define an anonymous function that looks very similar to our original length function.

Every level is identical except for the function we’re calling into in the next level. Sometimes we call into another length-like function, but at the very end we call into our raising function.

We can cut down on all that repeated code by writing a function that makes each level of this length stack. Our make_length function takes the bit that changes in each level, the next function to call into, as an argument:


make_length = fn
  length ->
    fn
      []    -> 0
      [h|t] -> 1 + length.(t)
    end
end

Now we can use make_length to write a length function that works on empty lists:


make_length.(fn _ -> raise "oops" end)

And similarly, we can use make_length to write a length function that works on lists of length zero or one:


make_length.(make_length.(fn _ -> raise "oops" end))

This works because we create our zero-length function and pass it into make_length, which wraps it in another application of our length-like function.

We can even use Elixir pipes to make things a bit cleaner:


(fn _ -> raise "oops" end)
|> (make_length).()
|> (make_length).()

Without using the make_length binding, we could define a totally anonymous version of length that works on lists up to three elements long:


(fn
  make_length ->
    (fn _ -> raise "oops" end)
    |> (make_length).() # support for length 1
    |> (make_length).() # support for length 2
    |> (make_length).() # support for length 3
end).(fn
  length ->
    fn
      []    -> 0
      [h|t] -> 1 + length.(t)
    end 
end)

Now we’re getting somewhere!

While we’ve managed to cut out most of the repetition of our original solution, we’re still facing the problem of being restricted to a finite number of recursions.

Building the Road as we Go

In our current solution, we need to define, in advance, the maximum list length our anonymous length function will support.

Every repeated application of make_length adds another layer to our growing stack of functions. But if we’re passed a long enough list that chews through that stack, the raising function is called and our length calculation grinds to a screeching halt.

To solve the last piece of this puzzle, we need to think very carefully about what it is we’re doing…


Every time our length-like function is called it decides whether or not it needs to recurse. If it does, it calls the length function tied to its closure. If it doesn’t, it simply returns a value which trickles its way back up the call stack.

The danger is that the length function might actually point to our raising function.

Instead of throwing an exception when we hit that last function in the stack, we’d really like to add another layer of our length-like function and call that instead. Really, we only ever need two functions in our stack. The current function, and the next function we want to call into.

Each layer of our length-like function is created by passing make_length into make_length. Let’s simplify our stack creation by pulling out our raising function and simply returning the next layer of our length-like function:


(fn
  make_length ->
    make_length.(make_length) # Add another layer
end).(fn
  length ->
    fn
      []    -> 0
      [h|t] -> 1 + length.(t)
    end 
end)

At this point we could even rename the length argument in our second anonymous function to make_length to remind ourselves of what’s not being passed into it.


(fn
  make_length ->
    make_length.(make_length)
end).(fn
  make_length -> # We're being passed make_length
    fn
      []    -> 0
      [h|t] -> 1 + make_length.(t)
    end 
end)

But there’s a problem here.

Now we’re trying to pass the tail of our list (t) into make_length. This doesn’t make much sense. The make_length function expects a function as an argument, and returns a function that accepts a list.

What could we pass into make_length to build the next layer of our recursion? Take a minute, and remember that at this point make_length represents the entirety of our second anonymous function:


(fn
  make_length ->
    fn
      []    -> 0
      [h|t] -> 1 + make_length.(t)
    end 
end)

Well, couldn’t we just pass make_length into make_length? Yes!

Passing make_length into make_length returns another layer of our length-like function that accepts a list as an argument:


(fn
  make_length ->
    make_length.(make_length)
end).(fn
  make_length ->
    fn
      []    -> 0
      [h|t] -> 1 + (make_length.(make_length)).(t)
    end 
end)

This new layer of our length-like function still has access to make_length, so if it decides to recurse, it has everything it needs to build and apply the next layer of length. We don’t need to build our entire tower of length functions up-front. We can construct each layer as we go!

Amazing… Does your brain hurt yet?

Each generation would be both content and vessel, an echo in a self-sustaining reverberation. — Ted Chiang in Stories of Your Life and Others

Reclaiming Our Length Function

As it stands now, our solution is fully functional. We can calculate the length of any list, no matter its size:


(fn
  make_length ->
    make_length.(make_length)
end).(fn
  make_length ->
    fn
      []    -> 0
      [h|t] -> 1 + (make_length.(make_length)).(t)
    end 
end).([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...])

The only problem with this solution is that we’ve lost sight of the structure and shape of our original function:


fn
  []    -> 0
  [h|t] -> 1 + length.(t)
end

Instead, we’ve had to modify that function to include strange implementation-specific calls to make_length:


fn
  []    -> 0
  [h|t] -> 1 + (make_length.(make_length)).(t)
end

If we could reclaim the original shape of our length function, we could theoretically factor it out and create a generic utility that could be used to make any function recursive without modification.

The first step towards our goal is refactoring the code that makes our next application of length ((make_length.(make_length))) into a function of its own:


fn
  t ->
    (make_length.(make_length)).(t)
end

This function accepts t (the tail of our list) as an argument and applies t to the next layer of length after it’s been created.

The result of this function is actually length! We can write another closure that accepts length as an argument and pass this new function into it. The result is slightly longer, but we’ve managed to reclaim the structure of our original length function:


(fn
  make_length ->
    make_length.(make_length)
end).(fn
  make_length ->
    (fn
      length ->
        fn
          []    -> 0
          [h|t] -> 1 + length.(t)
        end
    end).(fn
      t ->
        (make_length.(make_length)).(t)
    end)
end)

Our reclaimed length function doesn’t make any references to make_length, so we’re free to untangle it from this rat’s nest of anonymous functions.

Let’s wrap the entire block in one more closure that accepts our length function as a parameter:


(fn
  length ->
    (fn
      make_length ->
        make_length.(make_length)
    end).(fn
      make_length ->
        length.(fn
          t ->
            (make_length.(make_length)).(t)
        end)
    end)
end).(fn
  length ->
    fn
      []    -> 0
      [h|t] -> 1 + length.(t)
    end 
end)

With length factored out, let’s refactor the meat of our solution to use slightly more traditional, albeit significantly less readable argument names:


y = fn
  f ->
    (fn
      x ->
        x.(x)
    end).(fn
      x ->
        f.(fn
          t ->
            (x.(x)).(t)
        end)
    end)
end

What’s especially amazing is that we can use this helper function, y, to make any function recursive! Here’s a recursive function that calculates the Nth number in the Fibonacci series:


y.(fn
  fib ->
    fn
      0 -> 0
      1 -> 1
      n -> fib.(n-1) + fib.(n - 2)
    end 
end)

As you may have guessed, this thing that we’ve come up with has a name. It’s the Y combinator! It’s my opinion that the Y combinator is one of the most beautiful structures in all of computer science.

Final Thoughts and More Resources

I hope this article helped you understand the Y combinator in all of its recursive, fractal glory.

To be honest, I still find the Y combinator to be a difficult thing to understand. I understand each step of the derivation, but when confronted with the Y combinator in its condensed, Lambda Calculus form, I find myself unable to succinctly sum up what it does in plain English:

λf. (λx. f (x x))(λx. f (x x))

It’s all Greek to me.

If you need more help wrapping your brain around this slippery idea, I highly recommend you check out The Little Schemer by Daniel P. Friedman and Matthias Felleisen. This book opened my eyes to the beauty of the Y combinator and many other equally mind-blowing topics. Highly recommended.

If books aren’t your thing, check out this gentle introduction to the Y combinator on Computerphile, or read through this in-depth article to deepen your understanding.

Formatting with Vim Scripts

Written by Pete Corey on Oct 16, 2017.

Over the years my brain has become wired to think in Vim when it comes to editing text. For example, when I want to change an if guard, I don’t want to click, <Del><Del><Del><Del><Del><Del>. Instead I simply change (c) inside (i) parentheses (.

The Vim Way™ of doing things has become so completely ingrained into me that I find myself reaching for Vim when it comes to any and every task involving editing text.

I’ve even incorporated Vim into my writing workflow. Check out how I use Vim scripts to format markdown and Git logs into the articles on this site.

Formatting Markdown

I do the majority of my writing in Ulysses. While it’s a fantastic writing application, but it’s very stubborn about its Markdown formatting. Combined with my even more stubborn Jekyll setup, I usually have to extensively massage every article before it’s formatted well enough to post.

This usually means doing things like decorating inline code blocks with language declarations, wrapping paragraph-style code blocks in <pre><code> tags, and other general formatting changes.

Originally, I did all of this work manually. Needless to say, that was a very time consuming and error-prone process.

While researching ways of automating the process, I had a recurring thought:

I know how to make all of these changes in Vim. Can’t I just script them?

It turns out I could! A Vim “script” can be written by writing Vim commands directly into a file, and running that script file against the file you want to modify. Armed with the global and normal Vim commands, anything is possible.


As a quick introduction to this idea, imagine we have a script file called script.vim:

:%s/cat/dog/g

And now imagine we have a file we want to modify:

I love cats!
Don't you love cats?

Our script will replace all instances of cat with dog. We can run this script against our file by executing the following command and saving the resulting Vim buffer:

vim -s script.vim text

Now that we have the basics under our belt, let’s take things up a notch. Here’s the Vim script I use to transform the markdown produced by Ulysses into a format accepted by Jekyll:

:%s/\t/    /g
:%s/’/'/g
:%s/“/"/g
:%s/”/"/g
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>"
:g/pre class/exe "normal! }O</code></pre>"
:g/pre class/exe "normal! V}2<"
:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g

The first line replaces all tab characters with spaces. The next three lines replace unicode quotes with their standard ascii equivalents.

The fifth line is especially interesting. It matches on all lines who’s next line starts with four spaces. It goes into normal mode, opens a line below the matched line and inserts my <pre ...><code ...> tags.

The next line in the script finds where each code block ends and closes those tags.

If this script is saved in a file called clean-post.vim, I can run it on a markdown file with the following command:

vim -s ./clean-post.vim ./_posts/post.markdown

As an added benefit, I can review the results of the transformation in Vim before saving the changes.

Formatting Git Logs

Similarly, I have a script dedicated to formatting my literate commit posts. This script is different.

Instead of formatting Markdown generated by Ulysses, it formats an entire git log (specifically git log --reverse --full-diff --unified=1 -p .), into a human readable article:

Go
g/^Author:/d
:g/^Date:/d
:g/^commit/exe "normal! dwdwjjwwi[\<Esc>A](/commit/\<Esc>pA)\<Esc>"
:g/^---/d
:g/^index/d
:g/^diff/d
:g/^new file/d
:g/^\\ No newline/d
:g/^@@.*@@$/d
:g/^@@/exe "normal! dt@..dwc$ ...\<Esc>"
:g/^+++/exe "normal! dwd2li\<CR><pre class='language-vimDiff'><p class='information'>\<Esc>"
:g/pre class/exe "normal! A</p><code class='language-vimDiff'>\<Esc>"
:g/^    /normal! 0d4l
:g/^ # \[/normal! dl
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>V}2<"
:g/pre class/exe "normal! }O</code></pre>"
gg:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g

While this script seems intimidating, the majority of the work its doing is simple string replacements. Combined with some fancy work done in normal mode, it transforms the output of a full Git log into a beautifully formatted markdown article.

Talk about power!

Final Thoughts

Vim scripts are not pretty by any means, but they get the job done. I continually find myself amazed by the power and utility of Vim when it comes to nearly every aspect of text editing.

If you use Vim and you find yourself repeatedly making similar edits to files, I encourage you to consider scripting those edits. We should always be striving to improve our process, and writing Vim scripts is an incredibly powerful tool that we often overlook.