Automatic CloudSQL Load Balancing for Kubernetes with Terraform, PGPool-II and Too Much Bash Scripting

TL;DR: just give me the Chart

What follows is a lot of words, including several digressions of questionable amusement and/or informative value. If you just want to kick the tires, here's the precis: this is a Kubernetes deployment of PGpool-II that automatically discovers a Google CloudSQL primary instance and any read replicas, configures round-robin load balancing of SELECT queries across the read replicas, and which auto-updates itself as read replicas are created or deleted.

Skip down to the "Play along with the home game" section for quick install instructions.

Once upon a timeseries...

Oden Technologies is primarily in the business of ingesting, processing, storing and querying timeseries data for our customers: vectors of point-in-time measurements from each of our customers' machines, aggregated by line and factory.

But in addition to the timeseries data, we also store a fair bit of metadata: data about the data. Things like mappings of timeseries data keys to human-friendly identifiers, user ACLs, login session keys and so forth. To store and query such metadata, we reach for one of the oldest and best-understood tools in the contemporary tech landscape: an ACID-compliant relational database, in our case the venerable and battle-hardened PostgreSQL.

And of course like many contemporary data services companies, we leverage the product line of our cloud provider to deploy and manage our Postgres instances for us; in this case Google's CloudSQL for PostgreSQL, which lets us use Google's Cloud APIs to create, scale, back up, restore and replicate database instances, and provide deep observability of our database usage, all without hiring a single DBA.

Old Reliable: PGbouncer gets us through the early years

For the first several years of Oden's existence, the most pressing problem we had to solve was how to let our long-running and batch tasks running in Kubernetes access Postgres without potentially overwhelming it with connections. For this, we turned to probably the most popular open source product in this space: PgBouncer, a lightweight connection-pooling proxy for Postgres.

And PgBouncer was (and is) great! It does exactly what it says on the tin: it pools client connections in front of your Postgres server, and provides deep metrics about usage on a per-client and per-database level, which we exported to Google Cloud Monitoring (née Stackdriver) using Influx Data's swiss army knife of metric translation, Telegraf.

You knew there was a "but": horizontal scaling over read replicas or rather the lack thereof

Like most cloud-provided SQL products, CloudSQL offers one-touch creation of read replicas -- additional instances of the database that use PostgreSQL's native log-based replication to mirror changes to the primary database and therefore allow horizontal scaling of reads and to allow operations which lock tables on the primary to not block reads on the replicas.

Unfortunately, once you've created a CloudSQL read replica, you are somewhat on your own in terms of routing traffic to it:

Google's internal load balancing products do not allow for using CloudSQL instances as backends, so you can't put all of the replicas behind an Internal Forwarding Rule.
You can use a tool like terraform to automatically add instance IP addresses to a DNS round-robin, but using DNS for load balancing is what people refer to as an "anti-pattern" when management is listening, and distinctly less polite things when in trusted company. (The Kubernetes service documentation has a concise summary of why it's a bad idea, and the problems it notes are in no way limited to users of Kubernetes.)
The "internal" address of a CloudSQL instance is assigned dynamically via Google's private serivces access tool and there is no way to statically provision addresses: if you delete and then re-create a replica for any reason (for example: to restore a backup to the primary instance, which cannot be done while replicas exist), the new replicas will likely have different IP addresses than their predecessors.

The upshot of all of these issues is that if you want to spread the read load over multiple replicas, never mind automatically keep up with the topology of your primary/replica(s), you're going to have to figure out how to do it yourself.

image of a sad cat

When we initially began thinking that this was a problem we needed to solve, our first instinct was to try to use the tool we had at hand: pgbouncer. Pgbouncer itself does not have any facility for load-balancing queries, but a quick Google search revealed that some of our friends at Amazon had published a patch to PGbouncer which not only added dynamic query routing, but which allowed you to define routing rules in Python. This looked like just the ticket!

There was just one problem: the pgbouncer-rr patch was written in order to load-balance between read-only Amazon RedShift data warehouse instances. So it made no distinction between SELECT and any other sort of query, and had no awareness of transaction state, which meant that while it could potentially balance traffic between read replicas, it had no ability to direct traffic appropriately to a primary instance, and in fact might send different statements inside a transaction to different instances!

We could have potentially used pgbouncer-rr to load-balance only for a pool of read replicas while manually sending transactions and updates/deletes to the primary, but it would have required rewriting a large amount of application code to pick a different endpoint based on the type of SQL query it was making, which is exactly the sort of difficult grunt work that we prefer computers do for us.

But then a bit of spelunking on Github revealed that an incredibly awesome person had made a patch to the patch which added transaction state awareness to the pbouncer-rr code! We were overjoyed! The only thing left to do was to figure out a bulletproof way to distinguish between non-mutating (SELECT) and potentially mutating (basically everything else) SQL statements outside of transactions, which turned out to be pretty easy: since the pgbouncer-rr patch used embedded Python to allow for user-defined routing patterns, we could just use pglast to parse each SQL statement and determine what type it was. It was brilliant! It was amazing! It totally worked! We had a much shorter version of this blog post ready to go and we were really looking forward to sharing it with you!

There was just one problem: it was 100% illegal. 😱

More perspicacious readers may already have asked themselves the question, "if this patch to pgbouncer is so great, why hasn't the main pgbouncer project adopted it?" (If you were one of them, please pat yourself on the back and then send us a copy of your resume.) And it turns out that the answer is simple: the pgbouncer-rr patch is released under the terms of the "Amazon Software License", which inside of a pile of boilerplate includes the following restriction:

3.3 Use Limitation. The Work and any derivative works thereof only may be
used or intended for use with the web services, computing platforms or
applications provided by Amazon.com, Inc. or its affiliates, including Amazon
Web Services, Inc.

It is literally a violation of Amazon's copyright to use the pgbouncer-rr patch or any derivative thereof to communicate with any non-Amazon product. You are technically in violation of the license if you test the patch against a local copy of Postgres! Presumably Amazon's lawyers will be indulgent of such misbehavior if you are already a paying user of Amazon RDS, RedShift or Aurora, but it seems unlikely they would approve of attempting to use it to improve the experience of working with a different cloud provider.

So that was a bummer. Back to the drawing board.

An elegant weapon for a more civilized age: PGPool-II, the good and the bad

One tool that we'd noticed while originally doing the research for this project was one that is hosted by the PostgreSQL Association itself: PGPool. PGPool has been around for a very long time: the first public release of v1 was in 2003, only six years after the initial release of Postgres itself.

PGPool in fact predates contemporary Postgres' native replication features, and indeed its original use case was not load balancing but creating HA clusters of Postgres servers by proxying all mutating queries to multiple backends. Later releases added load balancing and the ability to script automatic failover between primary and replica backend nodes and automatically recover failed nodes when they rejoined the cluster.

...all of which were impressive and useful features, but a bit of an odd fit for your average cloud-provider-managed SQL database scenario, wherein replication, failover and recovery are handled automatically by the provider, and where many of the default assumptions of a non-cloud architecture (such as "we control the IP addressing of the nodes" or "we use VRRP or equivalent layer-2 network tools to direct traffic" or even "we have shell-level access to the DB nodes") emphatically do not apply.

But despite that, PGPool is still in active use and development, it absolutely understood the concept of transactions and routing statements correctly between primary and replica nodes, and the documentation even provides examples for deploying it on Kubernetes. Those examples assumed that you would be running both PGPool and Postgres itself as Kubernetes deployments, but that just meant we were already halfway there.

So here was the challenge: deploy PGPool to k8s in such a way that it can discover CloudSQL backends automatically, and keep itself up to date with their replication topology. This sounds like exactly the sort of task that can be accomplished with a dodgy shell script and some templating, so let's go commit some devops!

But first...

Malkovich, Malkovich, Malkovich: A brief digression about CloudSQL instance naming

One of the more interesting lacunae of the Google CloudSQL API is that CloudSQL instance names cannot be reused for two weeks after an instance is deleted. As a result, a common pattern when creating CloudSQL instances is that the name will be in the form <name prefix>-<unique ID>, where the name prefix portion is the human-friendly name, but the unique ID is some string (a UNIX timestamp or just a few bytes of random characters) that is unlikely to collide with another instance.

In fact this pattern is so common that Terraform's documentation for its google_sql_database_instance resource type specifically demonstrates using the Terraform random_id resource to build the db instance name:

resource "random_id" "db_name_suffix" {
  byte_length = 4
}

resource "google_sql_database_instance" "postgres" {
  name             = "oden-metadata-${random_id.db_name_suffix.hex}"
  database_version = "POSTGRES_11"

  settings {
    tier = "db-f1-micro"
  }
}

Not only do we use this pattern when naming our CloudSQL instances at Oden, but we actually regularly rebuild (and thus rename) our instances in the process of doing monthly disaster recovery tests (soon to be the subject of an equally verbose blog post), which means that our PGPool automation has to be configured to discover a database instance that matches the name prefix; which is oden-metadata- in the example above.

Why is this important? Well, consider it dramatic foreshadowing for...

Hold my beer: scripting automatic topology updates

So what we need out of PGPool in a cloud deployment is that it (a) discover our primary DB instance, (b) discover any read replicas, and lastly and in some ways most importantly: (c) add and remove read replicas from load balancing dynamically, so that (for example) we can cope with a traffic surge by temporarily adding more SQL read capacity.

The first step, discovery, is in some ways the simplest one: Google provides the gcloud cli tool, which will happily exercise the instances.list API for you, and which uses GCP's general filter language to narrow down results. As noted in the previous section, we won't know the full instance ID in advance: it will be in the form <name prefix>-<unique id>. So our first step is to discover a primary instance with that name prefix:

# loop until we find at least one instance who's id starts with '${PRIMARY_INSTANCE_PREFIX}' in our current region
# and print the answer in the form "<instance-name>,<internal ip>"
until mapfile -t primary_instances < <(
  gcloud \
    --project "${PROJECT_ID}" \
    sql instances list \
      --filter "region:${REGION} AND name:${PRIMARY_INSTANCE_PREFIX} AND state:RUNNABLE AND instanceType:CLOUD_SQL_INSTANCE" \
      --format 'csv[no-heading](name,ip_addresses.filter("type:PRIVATE").*extract(ip_address).flatten())'); do
  log error "Could not successfully look up primary instance matching ${PRIMARY_INSTANCE_PREFIX}, sleeping 5s and re-looping"
  sleep 5
  continue
done

# the number of primary instances should == 1
if [[ "${#primary_instances[@]}" -ne 1 ]]; then
  log error "${#primary_instances[@]} entries returned by primary lookup?! '${primary_instances[*]}' sleeping 5s and retrying."
  sleep 5
  continue
fi

Then once we have the instance ID and address of the primary instance, we go hunting for read replicas:

mapfile -t current_replicas < <(
  gcloud \
    --project "${PROJECT_ID}" \
    sql instances list \
      --sort-by serverCaCert.createTime \
      --filter "region:${REGION} AND masterInstanceName:${PROJECT_ID}:${primary_name} AND state:RUNNABLE" \
      --format 'csv[no-heading](name,ip_addresses.filter("type:PRIVATE").*extract(ip_address).flatten())' \
    )

Then with the names and IP addresses of both the primary and a set of replicas in hand, the next step is to build a pgpool.conf file. Since the discovery script is written in bash, we use a simple tool called envtpl that uses the golang text/template package to read template values from environment variables and fill out a template:

# fill out the replicas (if found) with some gnarly template logic:
#
# - for every env var in the form "replica_ip_N" (where N is hopefully an integer)
# - the IP address of the node is the $value of the env var
# - split the var name ($key) by underscores -- the 3rd element is the $node_id
# - add a backend with that node ID
# - set the backend_weight to 1 / (total number of replicas)
#
# Note that the reported weight in SHOW POOL_NODES may not necessarily agree with
# what we set here: pgpool never prunes down nodes, and will recalculate weights
# based on its view of available servers.  (Which is why we heartbeat.)
{{ $replicas := environment "replica_ip_" -}}
{{ $num_replicas := len $replicas -}}
{{ range $key, $value := $replicas -}}
{{ $parts := split "_" $key -}}
{{ $node_id := $parts._2 -}}
# replica {{ $node_id }}
backend_hostname{{ $node_id }} = '{{ $value }}'
backend_port{{ $node_id }} = 5432
backend_weight{{ $node_id }} = {{ divf 1.0 $num_replicas }}
backend_flag{{ $node_id }} = 'ALLOW_TO_FAILOVER'
{{ end -}}

Trying it together, the discovery script exports a set of shell environment variables such as:

export replica_ip_1="10.0.0.7"
export replica_ip_2="10.0.1.8"

...and the template logic turns that into two pgpool backend definitions:

# replica 1
backend_hostname1 = '10.0.0.7'
backend_port1 = 5432
backend_weight1 = 0.5
backend_flag1 = 'ALLOW_TO_FAILOVER'

# replica 2
backend_hostname2 = '10.0.1.8'
backend_port2 = 5432
backend_weight2 = 0.5
backend_flag2 = 'ALLOW_TO_FAILOVER'

And then finally the discovery script tells pgpool to reload its configuration by using the pcp_reload_config command, which communicates with the pgpool over a unix domain socket.

Tying it all together, we run pgpool itself and the discovery/updater script in two separate containers in the pod, with the /etc/pgpool config directory shared as a volume between them. The entrypoint script for the pgpool container waits for the discovery script to run successfully at least once before starting the pgpool process:

until [ -f /etc/pgpool/pgpool.conf ]; do
  log info "Waiting 5s for our config to be generated"
  sleep 5
done

log info "Starting pgpool"
/usr/bin/pgpool -m fast -n 2>&1

Sidebar: eagle-eyed readers will have noticed the presence of a log command in the above shell scripts that appears to take levels! That's also something that we've found handy:

log(){
  local level msg highlight printcmd printflags emoji
  printcmd="${PRINTCMD:-echo}"
  printflags="${PRINTFLAGS:-"-e"}"
  level="$(tr '[:lower:]' '[:upper:]' <<< "${@:1:1}")"
  msg=("${@:2}")
  case "${level}" in
    FATAL) highlight="${RED}"; emoji="💀 "  ;;
    ERR*) highlight="${RED}"; emoji="⛔️ " ;;
    WARN*) highlight="${ORANGE}"; emoji="⚠️  " ;;
    DEBUG) if [[ "${DEBUG}" != "true" ]]; then return; fi; highlight=""; emoji="🔎 " ;;
    *) highlight="${CYAN}"; emoji="" ;;
  esac
  "${printcmd}" "${printflags}" "${highlight}$(date --iso=seconds --utc) ${emoji}${level}: ${msg[*]}${RST}" 1>&2
  if [[ "${level}" == "FATAL" ]]; then
    if [[ "${-}" =~ 'i' ]] ; then return 1; else exit 1; fi
  fi
}

But wait, it's not quite so simple.

We were very proud of ourselves for putting all of the above together, and immediately went about testing it. We deployed it in front of a cluster with a primary and one read replica: it worked! SELECT queries went to the replica; UPDATE/INSERT/DELETE queries went to the primary. Then we used the CloudSQL console to create a second read replica, and within 2 minutes of it becoming available, it started handling 50% of the SELECT traffic, which was exactly what we wanted to see. Then we deleted the new replica, and with only a handful of connection errors seen by clients, the original replica resumed taking 100% of all SELECT traffic.

Then we used the CloudSQL console to raise the number of replicas back to two, at which point everything caught on fire: the new replica never took any traffic and the pgpool logs started filling up with errors.

So remember how we said that pgpool was really designed for an era when you had absolute control of layer-2 networking and host naming/addressing? And remember how those are exactly what you don't have any control over in a managed SQL cloud product?

It turns out that pgpool really has no provision for fully removing backends from a running instance, nor for changing the IP of a backend. So if when we start out, we have two backends defined:

backend_hostname1 = '10.0.0.7'
backend_hostname2 = '10.0.1.8'

...and then a few minutes later we update the configuration file to say something like:

backend_hostname1 = '10.0.0.7'
backend_hostname2 = '10.5.4.3'

...internally, "backend 2" is not removed, nor is its hostname re-parsed or re-resolved. The only way to force that is to fully restart the pgpool process, which will of course interrupt all in-flight connections, which in a high-traffic scenario is something you'd probably rather not do.

But wait! A dodgy bash script got us into this, maybe a dodgy bash script can get us out of it?

The trick for handling this scenario turned out to be to simply never "remove" backends from the point of view of the pgpool.conf file. To accomplish this, we needed to make backend ID numbers persistently mapped to IP addresses: even if there were two read replicas in play in both the "before" and "after" state, the fact that the newer replica would have a different IP address than the one it replaced would mean that we would have three backends defined in pgpool.conf; just one of them would be "down" forever.

...but since the discovery script itself is just a dumb shell script that might at any time be restarted by the pod, that means we're going to have to maintain some state somewhere. But as it happens, we already have a bit of shared state between the two containers: the volume mounted as /etc/pgpool. So we put on our thinking caps and came up with this:

get_repl_pool_node_id(){
  local ip dir node_id
  ip="${1}"
  dir="${2:-"/etc/pgpool/states"}"
  # start the count at 1 because pgpool.conf makes no distinction
  # between primaries and replicas at this level
  node_id=1
  if [ -z "${ip}" ]; then
    log error "You must provide an IP address"
    return 1
  fi
  mkdir -p "${dir}"
  if [ -f "${dir}"/"${ip}" ]; then
    node_id="$(cat "${dir}/${ip}")"
    log debug "Returning existing node_id for ${ip}: ${node_id}"
  else
    if ls "${dir}"/* >/dev/null 2>/dev/null; then
      log debug "state files exist already"
      node_id="$(cat "${dir}"/* | sort -nr | head -1)"
      log debug "highest: ${node_id}"
      ((node_id++))
    else
      log debug "no statefiles exist yet"
    fi
    log debug "Generated new node_id for ${ip}: ${node_id}"
    echo "${node_id}" > "${dir}/${ip}"
  fi
  echo "${node_id}"
}

Translated into English: given the IP address 10.0.0.7, look for a file called /etc/pgpool/states/10.0.0.7. If that file exists, its contents will be a number and that number is our node ID. If that file does not exist, cat all the files in that directory: each of them contains a number. Sort them to find the highest one, add one to that and that's our new node ID: save it as /etc/pgpool/states/10.0.0.7.

With this in place, we can be reasonably sure that for the lifetime of the pod, we will always generate a new pgpool backend definition in any circumstance where we create a new read replica, even if the total number of read replicas stays effectively the same. Pgpool will simply see the no-longer-in-use IP address as "down" and will not attempt to send traffic to it.

Maybe the real friends were the yaks we shaved along the way: getting useful metrics into Google Cloud Monitoring

a yak, shaving

One of the many things we really appreciated about Pgbouncer was that it provided detailed metrics about what it was doing: per-user, per-db connection stats, number of connections waiting for a pool slot and so forth. Its metrics were well-documented enough that there was a telegraf input plugin for it, which meant that you could feed those metrics into any output system supported by telegraf: Prometheus, Datadog, NewRelic or (in our case), Google Cloud Monitoring, formerly known as Stackdriver.

Obviously we wanted similarly detailed monitoring for something as important as a replacement for pgbouncer, so we were happy to note that pgpool provides fairly detailed statistics by implementing a number of SQL-like commands that you could run in-band when making postgres queries.

Unfortunatly, those commands are only SQL-"like" -- they return what looks like a table in psql, but it's not actually a table: you can't feed it into an actual SQL query, even just to pick out certain rows or columns, which severely limits its usefulness.

This is obviously a profound limitation, and the pgpool maintainers have a straightforward solution to it: they provide a Postgres extension called pgpool-adm, and all you have to do is compile it and install it into all of your Postgres nodes, and oh bother we're using a managed cloud product and you absolutely can't do any of that.

Luckily, there's another way to skin the cat, also provided by the pgpool maintainers: the pgpool2_exporter is a tool written in Go which connects to a pgpool instance, runs the various "SQL-like" commands, parses their output and exposes them as an HTTP endpoint which can be scraped by any tool that understands the Prometheus metrics format, such as... our old friend telegraf.

So our 2-container pod (one for the discovery script, one for pgpool itself) became a 4-container pod:

pgpool
the discovery/updater script
pgpool2_exporter, which scrapes pgpool metrics and exposes them as an http endpoint
telegraf, which reads the exporter endpoing and forward the metrics to Stackdriver

...or at least that was the plan. Getting there took a bit of doing, because it turned out that:

pgpool2_exporter mislabeled some COUNTER metrics as GAUGEs
pgpool2_exporter did not break down connection stats by database and user
pgpool2_exporter did not publish stats on backend usage
telegraf forwarding of COUNTER metrics to Stackdriver did not work at all

At the time of this writing, our published Helm chart for pgpool-cloudsql uses a forked version of both pgpool2_exporter and telegraf, pending the above-linked pull requests being merged. We hope very much to have that no longer be the case soon!

And all jokes about yak-shaving aside, getting to contribute code and fixes to the larger ecosystem is one of the nicer fringe benefits of working here, and we recommend the warm and fuzzy feeling it produces to everyone. (Especially to our friends at Amazon.)

Play along with the home game: trying this yourself!

If you're a Google Cloud Platform customer and you're familiar with Helm, you can install the whole shebang into your own Kubernetes cluster!

You can find the source code and documentation at https://github.com/odenio/pgpool-cloudsql, and you can launch it into your cluster by adding our Helm repository and launching the pgpool-cloudsql chart:

helm repo add pgpool-cloudsql https://odenio.github.io/pgpool-cloudsql && \
  helm repo update

RELEASE_NAME="my-pgpool-service" # a name (you will need 1 installed chart for each primary DB)
NAMESPACE="my-k8s-namespace"     # a Kubernetes namespace
CHART_VERSION="1.0.4"            # a chart version: https://github.com/odenio/pgpool-cloudsql/releases
VALUES_FILE="./my_values.yaml"   # your values file

helm install \
  "${RELEASE_NAME}" \
  odenio/pgpool-cloudsql \
  --atomic \
  --timeout=5m0s \
  --namespace="${NAMESPACE}" \
  --version="${CHART_VERSION}" \
  --valuesG"${VALUES_FILE}"

Please note that there are several Helm values that you are required to fill out; a bare-minimum local my_values.yaml would look like this:

discovery:
  primaryInstancePrefix: my-cloudsql-instance-name-prefix
pgpool:
  srCheckUsername: <postgres_username>
  srCheckPassword: <postgres_password>
  healthCheckUsername: <postgres_username>
  healthCheckPassword: <postgres_password>
exporter:
  postgresUsername: <postgres_username>
  postgresPassword: <postgres_password>

The full set of options can be seen in the chart values file.

The Future, Mister Gittes: where to next?

One thing that should be obvious is that while the specific implementation here is tied to Google Cloud Platform's APIs and tools, the general concept should work just fine on any cloud provider, and even despite us snarking about Amazon's idiosyncratic approach to open source software licensing it would probably be a great fit for people trying to reach Amazon RDS for Postgres instances from inside Amazon Elastic Kubernetes Service clusters -- likewise Azure Database for PostgreSQL and Azure Kubernetes Service. We would be delighted to entertain pull requests implementing support for these or any other cloud providers!

Another thing that's pretty clear at this point is that Stackdriver is not the most popular monitoring platform in the world, and since we already have telegraf in the mix, it should be reasonably straightforward to add support for Prometheus, NewRelic or whatever your favorite tool in this space is.

Other areas for improvement:

The scenario of a manual promotion of a CloudSQL Read Replica to primary status is not currently well tested.
Only a subset of PGPool's runtime configuration options are currently able to be set via Helm values.
Currently, pgpool-cloudsql uses passthrough cleartext authentication to the CloudSQL backend, which is obviously suboptimal; support for other auth methods would be great.
Likewise, support for configuring SSL/TLS encryption would be great.

We hope to implement many of these ourselves in the near future, but if you're feeling impatient, pull requests are welcome!