Pete Corey Writing Work Contact

Crawling for Cash with Affiliate Crawler

Written by Pete Corey on Nov 20, 2017.

Several weeks ago, I released a monster of an article titled Learning to Crawl - Building a Bare Bones Web Crawler with Elixir. I mentioned early on in that post that I was working on a small side-project that involved web crawling.

After a weekend of furious coding, the side project I mysteriously alluded to is now ready to be released to the world!

Without further ado, check out Affiliate Crawler!

Affiliate Crawler is a tool designed to help bloggers and content creators monetize their writing through affiliate and referral links. You give the tool a starting URL and it crawls through your website (following internal links), looking for links to external products and services that can be monetized through affiliate or referral programs (Amazon’s Affiliate Program, etc…).

The project largely came out of my search for inoffensive ways of monetizing the hundreds of articles I’ve produced over the years (without making me feel sleazy). Monetization through affiliate and referral linking seems like the best solution.

After spending hours researching affiliate programs and manually grepping through my posts for monetizable links, I realized that there might be value in automating that process. And with that blast of inspiration, Affiliate Crawler was born!

Affiliate Crawler run against this blog.

As I mentioned, the first version of Affiliate Crawler only took a weekend to build. That short turnaround time was intentional. After spending nearly six months on my last project, Inject Detect, I’ve learned the value of validating ideas before pouring your soul into their development.

That said, I’m starting small with a limited feature set and a small collection of affiliate and referral programs. My goal with this release is to see if there’s any potential here. If you’re curious about the nuts and bolts behind the project, I’ve opened sourced the entire thing.

Are you a blogger or a writer? Have you explored affiliate programs as a source of potential revenue? Do you think this type of tool valuable? Let me know! I’d love to hear from you.

Being John Malkovich on Twitter

Written by Pete Corey on Nov 13, 2017.

Twitter is a weird place.

I’ll be the first to admit that expressing yourself through text is hard enough as it is. Adding a one hundred forty two hundred eighty character restriction to the mix seems to inspire the worst in people. Based on how I see people treat one another on Twitter, it seems that hate is the most compressible emotion.

Maybe it was having someone explain the importance of building inclusive safe spaces, while in the same breath discussing the necessity of shared-blocking and network-based blocking (i.e. “Oh, this person follows Milo? Kick them in the shins!”). Maybe it was my fiancé reading passages of The Bhagavad Gita as I drifted off to sleep. Or maybe I’ve just watched Being John Malkovich one too many times.

Whatever the cause, I was inspired.

What if our day-in-day-out Twitter experience was forcibly injected with a healthy dose of empathy? After a few minutes of tinkering, I had a script that would literally put yourself in the e-shoes of everyone you encounter on Twitter:

setInterval(() => {
    var fullname = $(".DashboardProfileCard-name").text().trim();
    var username = $(".DashboardProfileCard-screennameLink .u-linkComplex-target").text().trim();
    var pic = $(".DashboardProfileCard-avatarImage").attr("src");
    $(".js-user-profile-link b, .ProfileCard-screennameLink .u-linkComplex-target").text(username);
    $(".js-action-profile-avatar, .js-user-profile-link").attr("src", pic);
}, 250);

The script simply replaces everyone’s full name, screen name, and profile picture with your own. The setInterval is a stopgap solution to make sure this replacement happens for any new tweets or modals that appear on screen during your browsing.

If you’re hesitant to run random scripts in your console, as you should be, check out this short demo:

As silly as this change might seem, browsing through Twitter like this has had a real, visceral affect on me.

What I think is especially interesting is that this script seems to (temporarily) trick my brain into asking “why did I tweet that”, instead of “why did they tweet that?”, which seems to trigger wholly other criteria for evaluation and judgement.

I strongly encourage you to try it out for yourself. If even just for a few minutes.

When he sees all being as equal
in suffering or in joy
because they are like himself,
that man has grown perfect in yoga.
The Bhagavad Gita

Rum Boogie Café

Written by Pete Corey on Nov 6, 2017.

Recently I ran into an interesting problem while working on a project with a Memphis-based client. This interesting problem led to several hours of sleuthing through HTTP headers, combing over hex dumps, and pouring through the source of several packages.

At the end of a long day, I came to the conclusion that character encodings are important, if often overlooked things, and that assumptions do indeed make asses out of you and me.

So buckle up, and I’ll tell you a story about Rum Boogie Café! Or is it Rum Boogie Caf�? Or… RUM BOOGIE CAFÉ,"D?

Invalid JSON

The project in question involves streaming massive JSON documents into a Node.js application from a proxy service which in turn pulls the original JSON documents from an external, third-party service.

Once streamed in, the Node.js application parses the incoming JSON with the JSONStream package, and Does Things™ with the resulting data.

This process was working beautifully for several relatively small JSON documents, but when it came time to parse larger, wilder JSON documents served by the third-party service, bugs started to crawl out of the woodwork.

The first sign of trouble was this exception:

Error: Invalid JSON (Unexpected "D" at position 2762 in state STOP)
    at Parser.proto.charError (/project/node_modules/jsonparse/jsonparse.js:90:16)
    at Parser.proto.write (/project/node_modules/jsonparse/jsonparse.js:154:23)
    at Stream.<anonymous> (/project/node_modules/JSONStream/index.js:23:12)

Well, that seems like an obvious problem. The JSON must be corrupt.

But after taking a look at the raw JSON served from the external service, we can see that the section of the document in question is perfectly well formed:

...,"DESC":"RUM BOOGIE CAFÉ","DEF":"",...

So what gives?

Invalid UTF-8?

Before spending too much time with this issue, I wanted to get more data. Was this a problem with this report specifically, or all larger reports?

I tried to process another similarly large JSON document served by the external service.

This similarly large document resulted in a similar exception:

Error: Invalid JSON (Invalid UTF-8 character at position 3832 in state STRING1)
    at Parser.proto.write (/project/node_modules/jsonparse/jsonparse.js:171:31)
    at Stream.<anonymous> (/project/node_modules/JSONStream/index.js:23:12)

This time around, the jsonparse package (a dependency of the JSONStream package we’re using) is complaining about an invalid UTF-8 character.


At this point, I have a hunch. Is the data being returned by our proxy service utf-8 encoded?

To find out, I fired up Postman and made a request to the proxy server to pull down the first of the large JSON documents. Interestingly, the HTTP response wasn’t specifying a character encoding, but it was returning a Content-Type of application/json which implies a default encoding of utf-8.

Let’s put this implication to the test.

We can use xxd to dump the raw hex of the JSON document being returned by the proxy service (after saving it to disk):

0074f60: 2042 4f4f 4749 4520 4341 46c9 222c 2244   BOOGIE CAF.","D

Our JSON parser is failing at the D at the end of this line. To verify that this is actually utf-8 encoded text, we’ll copy the relevant hex values for the line into a buffer in a new Node.js program:

let buffer = Buffer.from([
    0x43, // 'C'
    0x41, // 'A'
    0x46, // 'F'
    0xc9, // 'É'
    0x22, // '"'
    0x2c, // ','
    0x22, // '"'
    0x44, // 'D'

Next, we can print the buffer, decoding it as utf-8:


This gives us:


The wrong character…

Digging deeper, I realized that the proxy service was mangling the response headers of the external service it was proxying for. Thankfully, this was an easy fix.

Soon the Content-Type header of the newly-fixed proxy service revealed that the JSON documents were encoded with ISO-8859-1 (which Node.js refers to as latin1).


Decoding our buffer with latin1 gives us…


The right character! Victory!

Well, not really; our application is still broken. At least we know that we’re dealing with latin1 encoded text, not utf-8.

Going Spelunking

So now we know that the stream we’re passing into JSONStream, and ultimately jsonparse is latin1 encoded, not utf-8 encoded.

Why is this a problem?

Taking a look at the jsonparse source, we can see quite a few places where the code is making the assumption that any data streamed in will be utf-8 encoded.

Let’s trace through this code and find out what happens when it processes our latin1-encoded É character (remember, É has a hex value of 0xc9 and a decimal value of 201).

Let’s assume we’re in the process of working through a streamed in buffer. Our É character is within a JSON string, so we’d be in the STRING1 state when we encounter it. Let’s also assume we have no bytes_remaining for now.

The value of É is greater than 128 (201), so we’d fall into the “parse multi byte” block. Because É (201) is greater than 194 and less than 223, bytes_in_sequence would be set to 2. A few lines later, this 2 in bytes_in_sequence prompts jsonparse to swallow the next two bytes (",) from the buffer and include them as part of the current string.

Unfortunately, one of the characters that’s mistakenly swallowed is the JSON string’s terminating quote. The parser happily continues on until it finds another quote, the opening quote for the "DEF" string, and uses that as the closing quote for the current string.

At this point, the parser expects a valid starting character like a comma or a closing bracket. Instead, it finds our ill-fated D and throws a familiar exception:

Error: Invalid JSON (Unexpected "D" at position 2762 in state STOP)

Interestingly, the value of the wrongly-encoded character affects the behavior of this bug.

For example, if we were to pass in a non-breaking space with a latin1 character code of 160, an Invalid UTF-8 exception would be thrown.

Similarly, a character like ð with a latin1 character code of 240 would result in four characters being swallowed by the parser.

Fixing the Issue

Now that we know what the problem is, fixing it is simple. We’re streaming ISO-8859-1, or latin1 encoded data from our proxy service, but our streaming JSON parser expects the data to be utf-8 encoded.

We’ll need to re-encode our data into utf-8 before passing it into our parser.

This sounds daunting, but thankfully libraries like iconv-lite make it a very simple process. Especially when you’re using streams.

Assuming that our original setup looks something like this:


We can easily pipe our document stream through a conversion stream before handing it off to our parser:


And with that, all is right in the world. Everything works as expected, and we can get back to bigger and better things.

Final Thoughts

So in hindsight, was this a bug?

Probably not.

Neither the JSONStream documentation nor the jsonparse documentation make it explicit that your stream needs to be utf-8 encoded, but this seems like a reasonable assumption on their end.

Instead, I think this was a complicated set of misunderstandings and faulty assumptions that led to some bizarre, but technically correct behavior.

The moral of the story is that if you’re dealing with strings, you need to know how they’re encoded. Most developers keep string encodings out of sight and out of mind, but when things go wrong they can lead to time consuming and confusing bugs.

Next time you’re in Memphis, be sure to stop by RUM BOOGIE CAFÉ,"D!