blog/src/content/posts/2023-07-04-learn-by-implementing-nginx.md
2023-08-30 22:47:22 -05:00

7.6 KiB

title date tags draft
Learn by implementing Nginx's reverse proxy 2023-07-04
web
learn-by-implementing
true

Nginx is a powerful tool but also comes with many knobs, which may make it intimidating for lots of newcomers. In this post, let's rewrite its core functionality using a few lines of code to understand what it's doing.

To begin, what's a reverse-proxy?

  • A proxy usually lets you access a site through some gateway when reaching that site when your client is sitting behind some intercepting firewall
  • A reverse proxy lets others access a site through some gateway when reaching a server that's serving a site from behind a firewall

As a middleman, it gets all requests and can introspect on the header and body details. Which means it can:

  • Serve multiple domains on the same server / port
  • Wrap unencrypted services using HTTPS
  • Perform load balancing
  • Perform some basic routing
  • Apply authentication
  • Serve raw files without a server program

I'm going to implement this using Deno.

💡 This is a literate document. I wrote a small utility to extract the code blocks out of markdown files, and it should produce working example for this file. If you have the utility, then running the following should get you a copy of all the code extracted from this blog post:

markout --lang ts path/to/posts/2023-07-04-learn-by-implementing-nginx.md > program.ts

It can then be executed with Deno:

deno run --allow-net program.ts
Imports
import { serve } from "https://deno.land/std@0.192.0/http/mod.ts";
const PORT = 8314;

Deno implements an HTTP server for us. On a really high level, what this means is it starts listening for TCP connections, and once it receives one, listens for request headers and parses it. It then exposes methods for us to read the headers and decide how to further receive the body. All we need to do is provide it an async function to handle the request, and return a response. Something like this:

// @ts-ignore
async function handlerImpl1(request: Request): Promise<Response> {
  // code goes here...
}

// Later,
// serve(handlerImpl1, { port: PORT });

Now one of the primarily utilties of something like Nginx is something called virtual hosts. In networking, you would call a host the machine that runs the server program. However, virtual hosts means the same machine can run multiple server programs. This is possible because in the HTTP header information, the client sends the domain it's trying to access.

Host: example.com

Using this info, the server can decide to route the request differently on a per-domain basis. Something like SSH is not able to do this because nowhere during the handshake process does the client ever request a particular domain. You would have to wrap SSH with something else that's knowledgeable about that.

In our reverse-proxy example, we would want to redirect the request internally to some different server, and then serve the response back to the client transparently so it never realizes it went through a middleman!

So let's say we get some kind of config from the server admin, saying where to send each request. It looks like this:

interface Config1 {
  /** An object mapping a particular domain to a destination URL */
  [domain: string]: string;
}

Let's wrap our function with another function, where we can take in the config and make it accessible to the handler.

Why?

The serve here is what's called a higher-order function. This means that rather than passing just data to it, we're passing it a function as a variable to store and call of its own volition. A common example of a higher-order function is Array.map, where you take a function and apply it to all elements within the array.

So since serve is calling our handler, we cannot change its signature. That's because in order to change its signature, we have to change where it's called, which is inside the Deno standard library.

Fortunately, functions capture variables (like config) from outside of their scope, and when we pass it to serve, it retains those captured variables.

For an implementation like this, you don't actually need to wrap it in another function like mkHandler2, but I'm doing it here to make it easier to separate out the code into pieces that fit the prose of the blog post. You could just as well just define it like this:

const config = { ... };
const handler = async function(request: Request): Promise<Response> {
  // code goes here...
};
serve(handler, { port: PORT });
function mkHandler2(config: Config1) {
  // @ts-ignore
  return async function (request: Request): Promise<Response> {
    // code goes here...
  };
}

I'm going to write this in the most straightforward way possible, ignoring null cases. Obviously in a real implementation, you would want to do error checking and recovery (since the reverse proxy server must never crash, right?)

function mkHandler3(config: Config1) {
  return async function (request: Request): Promise<Response> {
    // Look for the host header
    const hostHeader = request.headers.get("Host") as string;

    // Look it up in our config
    const proxyDestinationPrefix = config[hostHeader] as string;

    // Let's fetch the destination and return it!
    const requestUrl = new URL(request.url);
    const fullUrl = proxyDestinationPrefix + requestUrl.pathname;
    return await fetch(fullUrl);
  };
}

Time to run it!

const config = {
  "localhost:8314": "https://example.com",
  "not-example.com": "https://text.npr.org",
};
const handler = mkHandler3(config);
serve(handler, { port: PORT });

First, try just making a request to localhost:8314. This is as easy as:

curl http://localhost:8314

This should load example.com, like we defined in our config. We did a simple proxy, fetched the resource, and sent it back to the user. If there was a resource that was not previously available to the public, but the reverse-proxy could reach it, the public can now access it. Our reverse proxy feature is done.

Next, try making a request to not-example.com. However, we're going to use a trick in curl to make it not resolve the address on its own, but force it to use our domain. This trick is just for demonstration purposes, but it emulates a real-world need to have multiple domains pointed to the same IP (for example, for serving the redirect from company.com to www.company.com on the same server)

curl --connect-to not-example.com:80:localhost:8314 http://not-example.com

This should produce the HTML for NPR's text-only page. This demonstrates that we can serve different content depending on the site that's requested.

Conclusion

This is a very bare-bones implementation, and lacks lots of detail. To begin with, none of the errors are handled, so if a rogue request comes in it could take down the server.

Those improvements would be necessary for a production-ready implementation, but not interesting for a blog post. For a non-exhaustive list of bigger improvements, consider:

  • Can we have the web server remove a prefix from the requested url's path name if we want to serve a website from a non-root path?
  • Can we allow the reverse-proxy to reject requests directly by IP?
  • Can we wrap non-HTTP content?
  • What are some performance improvements we could make?

In a potential future blog post, I'll explore some of these topics.