I’ve been re-reading one of my favorite programming books, The Little Schemer, and I had an epiphany in one of the final chapters.

Among many other mind-bending topics, the book goes through a derivation of the Y combinator. In the past, any technical explanations of the Y combinator have left me with a brain folded into a pretzel and a throbbing headache. This time, however, something clicked.

But with the help of the quirky guiding dialog of The Little Schemer, I feel like I finally understand the Y combinator.

I want to share that understanding with you, my Elixir-loving friend!

In this article we’ll briefly go over what exactly the Y combinator is, and then dive deep into how we can derive it using anonymous Elixir functions. This is definitely a brain-stretcher of an article, and I hope you find it as interesting as I do.

## But Why? (Pun Intended)

Why is it important that you know about the Y combinator? Honestly, unless you’re a professional computer scientist, it’s not. When it comes to day-to-day web development, the applicable benefits of knowing what the Y combinator is and how it works are practically nonexistent.

Putting practicality aside, the Y combinator is one of the most interesting objects in all of computer science.

In the most basic sense, the Y combinator allows computer scientists to define recursive functions in Lambda Calculus, a language that fundamentally doesn’t support recursion. The way in which it does this is one of the most clever and intricately beautiful solutions to any programming problem I’ve ever seen.

For me, the Y combinator is something like the Mandelbrot set, or Conway’s Game of Life. It’s something to be marveled at; something to be revered.

## Deriving An Example

Let’s start our dive into the world of the Y combinator by coming up with a recursive function we want to build without using recursion!

We’ll use a simple classic:

``````
def length([]),    do: 0
def length([h|t]), do: 1 + length(t)
``````

Defined as an anonymous function, `length` computes the length of a list:

``````
length([])     # 0
length([1])    # 1
length([1, 1]) # 2, etc...
``````

When `length` is passed an empty list, we return a length of `0`. When its passed a non-empty list, we return `1` plus the `length` of the tail of the list. We’re able to create this type of recursive definition because when defining functions using `def` and `defp`, Elixir is kind enough to allow the defined function to make direct references and calls to itself.

However, what if we were to try to write a recursive anonymous function?

``````
length = fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
``````

Here we’re binding `length` to an anonymous function. Within that anonymous function we attempt to recurse by calling `length`. Unfortunately, at this point `length` is not in scope, and so Elixir blows up with a compilation error:

``````** (CompileError) iex:9: undefined function length/0
(stdlib) lists.erl:1354: :lists.mapfoldl/3
(stdlib) lists.erl:1355: :lists.mapfoldl/3
``````

Is it possible to write a recursive anonymous function? Put another way, can you write a recursive function without being able to directly refer to yourself?

Yes! Behold the power of the Y combinator!

## Down the Rabit Hole

Before we start our deep dive, let’s pull back and simplify things a bit. While our recursive call to our anonymous `length` function doesn’t work, part of our `length` function is still correct.

Let’s pull out the recursive piece to highlight what we mean:

``````
length = fn
[]    -> 0
[h|t] -> 1 + (fn _ -> raise "oops" end).(t)
end
``````

Even with the recursive piece removed, we’re able to calculate the length of empty lists:

``````
length.([])   # 0
length.([1])  # oops
``````

We could add support for lists of length `1` by substituting the function that raises an exception (our “raising function”) with another identical copy of our `length` function:

``````
length = fn
[]    -> 0
[h|t] -> 1 + (fn
[]    -> 0
[h|t] -> 1 + (fn _ -> raise "oops" end).(t)
end).(t)
end
``````

Now our function works for lists of length `0` and `1`, but blows up on lists two or more elements long:

``````
length.([])     # 0
length.([1])    # 1
length.([1, 2]) # oops
``````

We could add support for even longer lists by continuously replacing the raising function with additional copies of our `length` function:

``````
length = fn
[]    -> 0
[h|t] -> 1 + (fn
[]    -> 0
[h|t] -> 1 + (fn
[]    -> 0
[h|t] -> 1 + (
...
).(t)
end).(t)
end).(t)
end
``````

While this works, it’s obviously infeasible. If we wanted to calculate the length of a one thousand element list, we’d need to chain at least one thousand function definitions together.

Not only that, but we’ll always have an upper bound using this technique. At some point, our chain will end and our raising function will throw an exception.

We need to do better.

## Don’t Repeat Yourself

If you look at the code above, you’ll notice lots of repetition. At each level of “recursion” we define an anonymous function that looks very similar to our original `length` function.

Every level is identical except for the function we’re calling into in the next level. Sometimes we call into another `length`-like function, but at the very end we call into our raising function.

We can cut down on all that repeated code by writing a function that makes each level of this `length` stack. Our `make_length` function takes the bit that changes in each level, the next function to call into, as an argument:

``````
make_length = fn
length ->
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
end
``````

Now we can use `make_length` to write a `length` function that works on empty lists:

``````
make_length.(fn _ -> raise "oops" end)
``````

And similarly, we can use `make_length` to write a `length` function that works on lists of length zero or one:

``````
make_length.(make_length.(fn _ -> raise "oops" end))
``````

This works because we create our zero-length function and pass it into `make_length`, which wraps it in another application of our `length`-like function.

We can even use Elixir pipes to make things a bit cleaner:

``````
(fn _ -> raise "oops" end)
|> (make_length).()
|> (make_length).()
``````

Without using the `make_length` binding, we could define a totally anonymous version of `length` that works on lists up to three elements long:

``````
(fn
make_length ->
(fn _ -> raise "oops" end)
|> (make_length).() # support for length 1
|> (make_length).() # support for length 2
|> (make_length).() # support for length 3
end).(fn
length ->
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
end)
``````

Now we’re getting somewhere!

While we’ve managed to cut out most of the repetition of our original solution, we’re still facing the problem of being restricted to a finite number of recursions.

## Building the Road as we Go

In our current solution, we need to define, in advance, the maximum list length our anonymous `length` function will support.

Every repeated application of `make_length` adds another layer to our growing stack of functions. But if we’re passed a long enough list that chews through that stack, the raising function is called and our length calculation grinds to a screeching halt.

To solve the last piece of this puzzle, we need to think very carefully about what it is we’re doing…

Every time our `length`-like function is called it decides whether or not it needs to recurse. If it does, it calls the `length` function tied to its closure. If it doesn’t, it simply returns a value which trickles its way back up the call stack.

The danger is that the `length` function might actually point to our raising function.

Instead of throwing an exception when we hit that last function in the stack, we’d really like to add another layer of our `length`-like function and call that instead. Really, we only ever need two functions in our stack. The current function, and the next function we want to call into.

Each layer of our `length`-like function is created by passing `make_length` into `make_length`. Let’s simplify our stack creation by pulling out our raising function and simply returning the next layer of our `length`-like function:

``````
(fn
make_length ->
end).(fn
length ->
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
end)
``````

At this point we could even rename the `length` argument in our second anonymous function to `make_length` to remind ourselves of what’s not being passed into it.

``````
(fn
make_length ->
make_length.(make_length)
end).(fn
make_length -> # We're being passed make_length
fn
[]    -> 0
[h|t] -> 1 + make_length.(t)
end
end)
``````

But there’s a problem here.

Now we’re trying to pass the tail of our list (`t`) into `make_length`. This doesn’t make much sense. The `make_length` function expects a function as an argument, and returns a function that accepts a list.

What could we pass into `make_length` to build the next layer of our recursion? Take a minute, and remember that at this point `make_length` represents the entirety of our second anonymous function:

``````
(fn
make_length ->
fn
[]    -> 0
[h|t] -> 1 + make_length.(t)
end
end)
``````

Well, couldn’t we just pass `make_length` into `make_length`? Yes!

Passing `make_length` into `make_length` returns another layer of our `length`-like function that accepts a list as an argument:

``````
(fn
make_length ->
make_length.(make_length)
end).(fn
make_length ->
fn
[]    -> 0
[h|t] -> 1 + (make_length.(make_length)).(t)
end
end)
``````

This new layer of our `length`-like function still has access to `make_length`, so if it decides to recurse, it has everything it needs to build and apply the next layer of `length`. We don’t need to build our entire tower of `length` functions up-front. We can construct each layer as we go!

Amazing… Does your brain hurt yet?

Each generation would be both content and vessel, an echo in a self-sustaining reverberation. — Ted Chiang in Stories of Your Life and Others

## Reclaiming Our Length Function

As it stands now, our solution is fully functional. We can calculate the length of any list, no matter its size:

``````
(fn
make_length ->
make_length.(make_length)
end).(fn
make_length ->
fn
[]    -> 0
[h|t] -> 1 + (make_length.(make_length)).(t)
end
end).([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...])
``````

The only problem with this solution is that we’ve lost sight of the structure and shape of our original function:

``````
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
``````

Instead, we’ve had to modify that function to include strange implementation-specific calls to `make_length`:

``````
fn
[]    -> 0
[h|t] -> 1 + (make_length.(make_length)).(t)
end
``````

If we could reclaim the original shape of our `length` function, we could theoretically factor it out and create a generic utility that could be used to make any function recursive without modification.

The first step towards our goal is refactoring the code that makes our next application of `length` (`(make_length.(make_length)))` into a function of its own:

``````
fn
t ->
(make_length.(make_length)).(t)
end
``````

This function accepts `t` (the tail of our list) as an argument and applies `t` to the next layer of `length` after it’s been created.

The result of this function is actually `length`! We can write another closure that accepts `length` as an argument and pass this new function into it. The result is slightly longer, but we’ve managed to reclaim the structure of our original `length` function:

``````
(fn
make_length ->
make_length.(make_length)
end).(fn
make_length ->
(fn
length ->
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
end).(fn
t ->
(make_length.(make_length)).(t)
end)
end)
``````

Our reclaimed `length` function doesn’t make any references to `make_length`, so we’re free to untangle it from this rat’s nest of anonymous functions.

Let’s wrap the entire block in one more closure that accepts our `length` function as a parameter:

``````
(fn
length ->
(fn
make_length ->
make_length.(make_length)
end).(fn
make_length ->
length.(fn
t ->
(make_length.(make_length)).(t)
end)
end)
end).(fn
length ->
fn
[]    -> 0
[h|t] -> 1 + length.(t)
end
end)
``````

With `length` factored out, let’s refactor the meat of our solution to use slightly more traditional, albeit significantly less readable argument names:

``````
y = fn
f ->
(fn
x ->
x.(x)
end).(fn
x ->
f.(fn
t ->
(x.(x)).(t)
end)
end)
end
``````

What’s especially amazing is that we can use this helper function, `y`, to make any function recursive! Here’s a recursive function that calculates the Nth number in the Fibonacci series:

``````
y.(fn
fib ->
fn
0 -> 0
1 -> 1
n -> fib.(n-1) + fib.(n - 2)
end
end)
``````

As you may have guessed, this thing that we’ve come up with has a name. It’s the Y combinator! It’s my opinion that the Y combinator is one of the most beautiful structures in all of computer science.

## Final Thoughts and More Resources

I hope this article helped you understand the Y combinator in all of its recursive, fractal glory.

To be honest, I still find the Y combinator to be a difficult thing to understand. I understand each step of the derivation, but when confronted with the Y combinator in its condensed, Lambda Calculus form, I find myself unable to succinctly sum up what it does in plain English:

λf. (λx. f (x x))(λx. f (x x))

It’s all Greek to me.

If you need more help wrapping your brain around this slippery idea, I highly recommend you check out The Little Schemer by Daniel P. Friedman and Matthias Felleisen. This book opened my eyes to the beauty of the Y combinator and many other equally mind-blowing topics. Highly recommended.

If books aren’t your thing, check out this gentle introduction to the Y combinator on Computerphile, or read through this in-depth article to deepen your understanding.

Over the years my brain has become wired to think in Vim when it comes to editing text. For example, when I want to change an `if` guard, I don’t want to `click`, `<Del><Del><Del><Del><Del><Del>`. Instead I simply change (`c`) inside (`i`) parentheses `(`.

The Vim Way™ of doing things has become so completely ingrained into me that I find myself reaching for Vim when it comes to any and every task involving editing text.

I’ve even incorporated Vim into my writing workflow. Check out how I use Vim scripts to format markdown and Git logs into the articles on this site.

## Formatting Markdown

I do the majority of my writing in Ulysses. While it’s a fantastic writing application, but it’s very stubborn about its Markdown formatting. Combined with my even more stubborn Jekyll setup, I usually have to extensively massage every article before it’s formatted well enough to post.

This usually means doing things like decorating inline code blocks with language declarations, wrapping paragraph-style code blocks in `<pre><code>` tags, and other general formatting changes.

Originally, I did all of this work manually. Needless to say, that was a very time consuming and error-prone process.

While researching ways of automating the process, I had a recurring thought:

I know how to make all of these changes in Vim. Can’t I just script them?

It turns out I could! A Vim “script” can be written by writing Vim commands directly into a file, and running that script file against the file you want to modify. Armed with the global and normal Vim commands, anything is possible.

As a quick introduction to this idea, imagine we have a script file called `script.vim`:

``````:%s/cat/dog/g
``````

And now imagine we have a file we want to modify:

``````I love cats!
Don't you love cats?
``````

Our script will replace all instances of `cat` with `dog`. We can run this script against our file by executing the following command and saving the resulting Vim buffer:

``````vim -s script.vim text
``````

Now that we have the basics under our belt, let’s take things up a notch. Here’s the Vim script I use to transform the markdown produced by Ulysses into a format accepted by Jekyll:

``````:%s/\t/    /g
:%s/’/'/g
:%s/“/"/g
:%s/”/"/g
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>"
:g/pre class/exe "normal! }O</code></pre>"
:g/pre class/exe "normal! V}2<"
:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g
``````

The first line replaces all tab characters with spaces. The next three lines replace unicode quotes with their standard ascii equivalents.

The fifth line is especially interesting. It matches on all lines who’s next line starts with four spaces. It goes into normal mode, opens a line below the matched line and inserts my `<pre ...><code ...>` tags.

The next line in the script finds where each code block ends and closes those tags.

If this script is saved in a file called `clean-post.vim`, I can run it on a markdown file with the following command:

``````vim -s ./clean-post.vim ./_posts/post.markdown
``````

As an added benefit, I can review the results of the transformation in Vim before saving the changes.

## Formatting Git Logs

Similarly, I have a script dedicated to formatting my literate commit posts. This script is different.

Instead of formatting Markdown generated by Ulysses, it formats an entire `git log` (specifically `git log --reverse --full-diff --unified=1 -p .`), into a human readable article:

``````Go
g/^Author:/d
:g/^Date:/d
:g/^commit/exe "normal! dwdwjjwwi[\<Esc>A](/commit/\<Esc>pA)\<Esc>"
:g/^---/d
:g/^index/d
:g/^diff/d
:g/^new file/d
:g/^\\ No newline/d
:g/^@@.*@@\$/d
:g/^@@/exe "normal! dt@..dwc\$ ...\<Esc>"
:g/^+++/exe "normal! dwd2li\<CR><pre class='language-vimDiff'><p class='information'>\<Esc>"
:g/pre class/exe "normal! A</p><code class='language-vimDiff'>\<Esc>"
:g/^    /normal! 0d4l
:g/^ # \[/normal! dl
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>V}2<"
:g/pre class/exe "normal! }O</code></pre>"
gg:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g
``````

While this script seems intimidating, the majority of the work its doing is simple string replacements. Combined with some fancy work done in `normal` mode, it transforms the output of a full Git log into a beautifully formatted markdown article.

## Final Thoughts

Vim scripts are not pretty by any means, but they get the job done. I continually find myself amazed by the power and utility of Vim when it comes to nearly every aspect of text editing.

If you use Vim and you find yourself repeatedly making similar edits to files, I encourage you to consider scripting those edits. We should always be striving to improve our process, and writing Vim scripts is an incredibly powerful tool that we often overlook.

I’ve been cooking up a side project recently that involves crawling through a domain, searching for links to specific websites.

While I’m keeping the details of the project shrouded in mystery for now, building out a web crawler using Elixir sounds like a fantastic learning experience.

Let’s roll up our sleeves and dig into it!

## Let’s Think About What We Want

Before we start writing code, it’s important to think about what we want to build.

We’ll kick off the crawling process by handing our web crawler a starting URL. The crawler will fetch the page at that URL and look for links (`<a>` tags with `href` attributes) to other pages.

If the crawler finds a link to a page on the same domain, it’ll repeat the crawling process on that new page. This second crawl is considered one level “deeper” than the first. Our crawler should have a configurable maximum depth.

Once the crawler has traversed all pages on the given domain up to the specified depth, it will return a list of all internal and external URLs that the crawled pages link to.

To keep things efficient, let’s try to parallelize the crawling process as much as possible.

## Hello, Crawler

Let’s start off our crawler project by creating a new Elixir project:

``````
mix new hello_crawler
``````

While we’re at it, let’s pull in two dependencies we’ll be using in this project: HTTPoison, a fantastic HTTP client for Elixir, and Floki, a simple HTML parser.

We’ll add them to our `mix.exs` file:

``````
defp deps do
[
{:httpoison, "~> 0.13"},
{:floki, "~> 0.18.0"}
]
end
``````

And pull them into our project by running a Mix command from our shell:

``````
mix deps.get
``````

Now that we’ve set up the laid out the basic structure of our project, let’s start writing code!

## Crawling a Page

Based on the description given above, we know that we’ll need some function (or set of functions) that takes in a starting URL, fetches the page, and looks for links.

Let’s start sketching out that function:

``````
[url]
end
``````

So far our function just returns a list holding URL that was passed into it. That’s a start.

We’ll use `HTTPoison` to fetch the page at that URL (being sure to specify that we want to follow `301` redirects), and pull the resulting HTML out of the `body` field of the response:

``````
with {:ok, %{body: body}} <- HTTPoison.get(url, [], follow_redirect: true) do
[url]
else
_ -> [url]
end
``````

Notice that we’re using a `with` block and pattern matching on a successful call to `HTTPoison.get`. If a failure happens anywhere in our process, we abandon ship and return the current `url` in a list.

Now we can use `Floki` to search for any `<a>` tags within our HTML and grab their `href` attributes:

``````
with {:ok, %{body: body}} <- HTTPoison.get(url, [], [follow_redirect: true]),
tags                 <- Floki.find(body, "a"),
hrefs                <- Floki.attribute(tags, "href") do
[url | hrefs]
else
_ -> [url]
end
``````

Lastly, let’s clean up our code by replacing our `with` block with a new `handle_response` function with multiple function heads to handle success and failure:

``````
def handle_response({:ok, %{body: body}}, url) do
[url | body
|> Floki.find("a")
|> Floki.attribute("href")]
end

def handle_response(_response, url) do
[url]
end

options = [follow_redirect: true]
url
|> handle_response(url)
end
``````

Hello, Crawler.

This step is purely a stylistic choice. I find `with` blocks helpful for sketching out solutions, but prefer the readability of branching through pattern matching.

Running our `get_links` function against a familiar site should return a handful of internal and external links:

``````
["http://www.east5th.co/", "/", "/blog/", "/our-work/", "/services/", "/blog/",
"/our-work/", "/services/",
"https://www.cigital.com/", "http://www.tapestrysolutions.com/",
"http://www.surgeforward.com/", "/blog/", "/our-work/", "/services/"]
``````

Quick, someone take a picture; it’s crawling!

## Crawling Deeper

While it’s great that we can crawl a single page, we need to go deeper. We want to scape links from our starting page and recursively scrape any other pages we’re linked to.

The most naive way of accomplishing this would be to recurse on `get_links` for every new list of links we find:

``````
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> List.flatten]
``````

While this works conceptually, it has a few problems:

• Relative URLs, like `/services`, don’t resolve correctly in our call to `HTTPoison.get`.
• A page that links to itself, or a set of pages that form a loop, will cause our crawler to enter an infinite loop.
• We’re not restricting crawling to our original host.
• We’re not counting depth, or enforcing any depth limit.

Let’s address each of these issues.

## Handling Relative URLs

Many links within a page are relative; they don’t specify all of the necessary information to form a full URL.

For example, on `http://www.east5th.co/`, there’s a link to `/blog/`. Passing `"/blog/"` into `HTTPoison.get` will return an error, as `HTTPoison` doesn’t know the context for this relative address.

We need some way of transforming these relative links into full-blown URLs given the context of the page we’re currently crawling. Thankfully, Elixir’s standard library ships with the fantastic `URI` module that can help us do just that!

Let’s use `URI.merge` and `to_string` to transform all of the links scraped by our crawler into well-formed URLs that can be understood by `HTTPoison.get`:

``````
def handle_response({:ok, %{body: body}}, url) do
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.map(&URI.merge(url, &1)) # Merge our URLs
|> Enum.map(&to_string/1)        # Convert the merged URL to a string
|> List.flatten]
end
``````

Now our link to `/blog/` is transformed into `http://www.east5th.co/blog/` before being passed into `get_links` and subsequently `HTTPoison.get`.

Perfect.

## Preventing Loops

While our crawler is correctly resolving relative links, this leads directly to our next problem: our crawler can get trapped in loops.

The first link on `http://www.east5th.co/` is to `/`. This relative link is translated into `http://www.east5th.co/`, which is once again passed into `get_links`, causing an infinite loop.

To prevent looping in our crawler, we’ll need to maintain a list of all of the pages we’ve already crawled. Before we recursively crawl a new link, we need to verify that it hasn’t been crawled previously.

We’ll start by adding a new, private, function head for `get_links` that accepts a `path` argument. The `path` argument holds all of the URLs we’ve visited on our way to the current URL.

``````
options = [follow_redirect: true]
url
|> handle_response(url, path)
end
``````

We’ll call our new private `get_links` function from our public function head:

``````
get_links(url, []) # path starts out empty
end
``````

Notice that our `path` starts out as an empty list.

While we’re refactoring `get_links`, let’s take a detour and add a third argument called `context`:

``````
url
|> handle_response(url, path, context)
end
``````

The `context` argument is a map that lets us cleanly pass around configuration values without having to drastically increase the size of our function signatures. In this case, we’re passing in the `headers` and `options` used by `HTTPoison.get`.

To populate `context`, let’s change our public `get_links` function to build up the map by merging user-provided options with defaults provided by the module:

``````
def get_links(url, opts \\ []) do
context = %{
options: Keyword.get(opts, :options, @default_options)
}
end
``````

If the user doesn’t provide a value for `headers`, or `options`, we’ll use the defaults specified in the `@default_headers` and `@default_options` module attributes:

``````
@default_options [follow_redirect: true]
``````

This quick refactor will help keep things clean going down the road.

Whenever we crawl a URL, we’ll add that URL to our `path`, like a breadcrumb marking where we’ve been.

Crawler's path.

Next, we’ll filter out any links we find that we’ve already visited, and append the current `url` to `path` before recursing on `get_links`:

``````
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.map(&URI.merge(url, &1))
|> Enum.map(&to_string/1)
|> Enum.reject(&Enum.member?(path, &1)) # Avoid loops
|> Enum.map(&get_links(&1, [&1 | path], context))
|> List.flatten]
``````

It’s important to note that this doesn’t completely prevent the repeated fetching of the same page. It simply prevents the same page being refetched in a single recursive chain, or path.

Imagine if page A links to pages B and C. If page B or C link back to page A, page A will not be refetched. However, if both page B and page C link to page D, page D will be fetched twice.

We won’t worry about this problem in this article, but caching our calls to `HTTPoison.get` might be a good solution.

## Restricting Crawling to a Host

If we try and run our new and improved `WebCrawler.get_links` function, we’ll notice that it takes a long time to run. In fact in most cases, it’ll never return!

The problem is that we’re not limiting our crawler to crawl only pages within our starting point’s domain. If we crawl `http://www.east5th.co/`, we’ll eventually get linked to Google and Amazon, and from there, the crawling never ends.

We need to detect the host of the starting page we’re given, and restrict crawling to only that host.

Thankful, the `URI` module once again comes to the rescue.

We can use `URI.parse` to pull out the `host` of our starting URL, and pass it into each call to `get_links` and `handle_response` via our `context` map:

``````
def get_links(url, opts \\ []) do
url = URI.parse(url)
context = %{
...
host: url.host # Add our initial host to our context
}
end
``````

Using the parsed `host`, we can check if the `url` passed into `get_links` is a crawlable URL:

``````
if crawlable_url?(url, context) do # check before we crawl
...
else
[url]
end
end
``````

The `crawlable_url?` function simply verifies that the host of the URL we’re attempting to crawl matches the host of our initial URL, passed in through our `context`:

``````
defp crawlable_url?(%{host: host}, %{host: initial}) when host == initial, do: true
defp crawlable_url?(_, _), do: false
``````

This guard prevents us from crawling external URLs.

Because we passed in the result of `URI.parse` into `get_links`, our `url` is now a `URI` struct, rather than a string. We need to convert it back into a string before passing it into `HTTPoison.get` and `handle_response`:

``````
if crawlable_url?(url, context) do
url
|> to_string # convert our %URI to a string
|> handle_response(path, url, context)
else
[url]
end
``````

Additionally, you’ve probably noticed that our collected list of links is no longer a list of URL strings. Instead, it’s a list of `URI` structs.

After we finish crawling through our target site, let’s convert all of the resulting structs back into strings, and remove any duplicates:

``````
def get_links(url, opts \\ []) do
...
|> Enum.map(&to_string/1) # convert URI structs to strings
|> Enum.uniq # remove any duplicate urls
end
``````

We’re crawling deep now!

## Limiting Our Crawl Depth

Once again, if we let our crawler loose on the world, it’ll most likely take a long time to get back to us.

Because we’re not limiting how deep our crawler can traverse into our site, it’ll explore all possible paths that don’t involve retracing its steps. We need a way of telling it not to continue crawling once it’s already followed a certain number of links and reached a specified depth.

Due to the way we structured out solution, this is incredibly easy to implement!

All we need to do is add another guard in our `get_links` function that makes sure that the length of `path` hasn’t exceeded our desired depth:

``````
if continue_crawl?(path, context) and crawlable_url?(url, context) do
``````

The `continue_crawl?` function verifies that the length of `path` hasn’t grown past the `max_depth` value in our `context` map:

``````
defp continue_crawl?(path, %{max_depth: max}) when length(path) > max, do: false
defp continue_crawl?(_, _), do: true
``````

In our public `get_links` function we can add `max_depth` to our `context` map, pulling the value from the user-provided `opts` or falling back to the `@default_max_depth` module attribute:

``````
@default_max_depth 3
``````
``````
context = %{
...
max_depth: Keyword.get(opts, :max_depth, @default_max_depth)
}
``````

As with our other `opts`, the default `max_depth` can be overridden by passing a custom `max_depth` value into `get_links`:

``````
``````

If our crawler tries to crawl deeper than the maximum allowed depth, `continue_crawl?` returns `false` and `get_links` returns the `url` being crawled, preventing further recursion.

Simple and effective.

## Crawling in Parallel

While our crawler is fully functional at this point, it would be nice to improve upon it a bit. Instead of waiting for every path to be exhausted before crawling the next path, why can’t we crawl multiple paths in parallel?

Amazingly, parallelizing our web crawler is as simple as swapping our map over the links we find on a page with a parallelized map:

``````
``````

Instead of passing each scraped link into `get_links` and waiting for it to fully crawl every sub-path before moving onto the next link, we pass all of our links into asynchronous calls to `get_links`, and then wait for all of those asynchronous calls to return.

For every batch of crawling we do, we really only need to wait for the slowest page to be crawled, rather than waiting one by one for every page to be crawled.

Efficiency!

## Final Thoughts

It’s been a process, but we’ve managed to get our simple web crawler up and running.

While we accomplished everything we set out to do, our final result is still very much a bare bones web crawler. A more mature crawler would come equipped with more sophisticated scraping logic, rate limiting, caching, and more efficient optimizations.

That being said, I’m proud of what we’ve accomplished. Be sure to check out the entire `HelloCrawler` project on Github.

I’d like to thank Mischov for his incredibly helpful suggestions for improving this article and the underlying codebase. If you have any other suggestions, let me know!