Here at Tomorrow.io, we lean on fantastic open source tools such as JupyterHub to boost scientist productivity and support science and data science applications. Our preferred way to deploy and host a JupyterHub instance that all of our R&D team can use these days is to run a custom flavored “Pangeo cluster” on top of Google Kubernetes Engine, with bells and whistles for providing the team with whatever resources they may need to tackle big weather and climate data problems. 

But as an enterprise company, we take security very seriously. So just how do you secure a Pangeo GKE cluster on an untrusted network with no VPN client while allowing your users to authenticate seamlessly using their regular SSO credentials? With Identity Aware Proxy of course!

Identity Aware Proxy (IAP) is a Google Cloud specific solution that allows for authentication via OAuth to HTTPS or SSH/TCP resources. Here I am using IAP in two contexts:

The end result is an authentication screen that prompts users for credentials before they view the HTTPS-secured Pangeo landing page.

www.tomorrow.io research

pangeo dev research

Hardened Private GKE Cluster

The “Safer Cluster” follows Google’s recommendations for hardening their clusters found here.  There were three main consequences of using this secured private cluster.  

The first consequence of the secure cluster was using kubectl and helm through the proxy which is documented on the Terraform module page.  I had to be logged in through the gcloud utility (which is the IAP auth) as well as keep a constant SSH connection open to the bastion host in one of my terminal screens as I worked.  I also had to set an environment variable in my terminal where I used kubectl.

export HTTPS_PROXY=localhost:8888

Once I did that, I could use kubectl and helm as usual, although I found that intermittently it would fail with a connection error. I resolved this by closing and reopening the SSH connection to the bastion host. One caveat here was that I found that this environment variable interfered with other proxies such as the gcloud authentication login command. I resolved this by unsetting the environment variable.

The second consequence of using the secure cluster was the hardened Pod Security Policy. The advantage here is that no privileged pods could be created.  However when I naively used Helm to install pangeo without understanding this, services could not launch pods at all, even unprivileged ones.  I had to create a Pod Security Policy that allowed the creation of unprivileged pods, then create a cluster role and cluster role binding that permitted all authenticated users to access this policy.

Podsecuritypolicy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false  # Don't allow privileged pods!
  # The rest fills in some required fields.
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  volumes:
  - '*'
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: psp:restricted
rules:
- apiGroups:
  - extensions
  resources:
  - podsecuritypolicies
  resourceNames:
  - restricted # the psp we are giving access to
  verbs:
  - use
---
# This applies psp/restricted to all authenticated users
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: psp:restricted
subjects:
- kind: Group
  name: system:authenticated # All authenticated users
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: psp:restricted # A references to the role above
  apiGroup: rbac.authorization.k8s.io

Then services were able to create pods as usual. 

The third main consequence of using the Hardened cluster is that it does not work with a SaaS continuous deployment provider that goes over the public internet such as Terraform Cloud.  Configuring CD for this setup is outside the scope of this document.

Securing the Pangeo Application with IAP

For this, I relied in part on Google’s documentation, however much of it was not included in this doc. There is a blog post from DoiT around securing Vault that includes helpful information on IAP about halfway down.

Summary of the steps for enabling IAP for Pangeo:

 These first five steps are documented in Google’s documentation 

These next steps were not part of the Google documentation but were validated by an expert from DoiT:

These steps were specific to configuring Pangeo for working with IAP and https:

Create a global static IP address

Follow https://cloud.google.com/compute/docs/ip-addresses/reserve-static-external-ip-address and make a note of the name of the IP address so it can be associated with the Ingress.

Create a DNS record pointing to the IP address

Follow https://cloud.google.com/dns/docs/records/ to create a DNS record that points to the static IP address created above.

Configure the Pangeo proxy-public service

This should be done with Helm.  Enable https with type set to offload and hosts set to an array with your domain.  Pass in NodePort as the proxy type and the annotation referencing the IAP Backend Config.

pangeo:
  jupyterhub:
    proxy:
      https:
        enabled: true
        type: offload
        hosts:
        - pangeo.example.com
      service:
        type: NodePort
        labels: {}
        annotations:
          beta.cloud.google.com/backend-config: '{"default": "iap-backend-config"}'

Create a Google Managed Certificate

I added this to our helm deployment:

apiVersion: networking.gke.io/v1beta1
kind: ManagedCertificate
metadata:
  name: {{ .Values.cert_name }}
spec:
  domains:
    - {{ .Values.domain }}

Be aware that the certificate takes some time to provision as does the ingress, they will not work instantly when they have been deployed.

Create an Ingress

At first I used the Ingress YAML file that came with Jupyterhub to create the Ingress, however the path is hard coded into the Helm template file which causes a 404 when it comes up.  Therefore I created my own Ingress that went in as a Helm template:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: pangeo-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: {{ .Values.static_ip_name }}
    networking.gke.io/managed-certificates: {{ .Values.cert_name }}
spec:
  backend:
    serviceName: proxy-public
    servicePort: 80
Validate the domain ownership

Navigate to https://www.google.com/webmasters/verification/home?hl=en and click “Add a Property”.  Enter the DNS you created.  Click “Alternate Methods” and Domain Name Provider.  Select Google Domains.  It will give you a TXT record to add to your DNS.  Add the record as documented in https://cloud.google.com/dns/docs/records/ .  Wait a minute or two then click “Verify”.   If it doesn’t go through right away wait another few minutes and try again.

Recreate the ingress and the certificate

I experienced an issue with the Certificate in which it was stuck in Provisioning state after I changed the IP address it pointed to and verified the domain.  Deleting the Ingress and the Certificate and recreating them cleared this up.

Add Users as IAP Tunnel Users

In the Identity Aware Proxy screen under Security you should see your Ingress with IAP turned on. You need to select this Ingress and in the side panel click “Add Member”. This could be any user or group you choose. If you forget this step your users will be unable to log in and will get a denial screen.

Configure the Dask Gateway Address

The Dask Gateway should not point to the external domain (in our case pangeo-dev.research.www.tomorrow.io) but to the internal service ( in our case traefik-www.tomorrow.io-prod-dask-gateway.www.tomorrow.io-prod , where Tomorrow.io-prod is the namespace ) so that it bypasses the IAP challenge.

   singleuser:

      extraEnv:

        # DASK_GATEWAY__ADDRESS: "https://pangeo-dev.research.www.tomorrow.io/services/dask-gateway/" #   Wrong

        DASK_GATEWAY__ADDRESS: "http://traefik-www.tomorrow.io-prod-dask-gateway.www.tomorrow.io-prod/services/dask-gateway" # Right

To determine your service’s url, take the service name of your Dask Gateway and combine it with the namespace it is running in.

Ensure that both Dask and Bokeh are up to date

During the course of building the Docker images for Kubernetes we ran into two issues that were resolved by upgrading the versions.  The first was a breaking change with Dask in 2.19.0 documented here.  The second was a bug that caused the dashboard to give intermittent 404s introduced in Bokeh 2.1.0 but resolved in 2.1.1 and documented here.

Now you know how IAP can be leveraged from the group up to create a secure Kubernetes environment for Pangeo, while preserving easy access for your users.  

How do you secure your Pangeo cluster?