The RMBL Compute Hub: A Deep Dive

The problem we’re solving

Most of us have lived some version of this story. A collaborator shares a promising dataset — a stack of climate rasters, a season of drone imagery, a multi-terabyte hyperspectral mosaic — and the first hour of “doing science” evaporates into something else entirely. You install GDAL and it fights with your existing PROJ. You discover the file is 40 GB and your laptop has 16. You download a subset, then realize you need a different subset. A teammate runs the same analysis and gets a different answer because their package versions drifted. By the time the environment works, the afternoon is gone and you haven’t looked at a single number.

Environmental science at RMBL has become a data-intensive discipline. The upper East River watershed is now one of the most thoroughly instrumented and remotely sensed mountain landscapes in the world. Between the Spatial Data Platform (SDP) catalog of cloud-optimized climate, snow, vegetation, and terrain products and the CHESS project’s NEON Airborne Observation Platform (AOP) imaging-spectrometer mosaics, a single researcher now has casual access to data volumes that would have been a major facility’s holdings a decade ago. The bottleneck has shifted. It is no longer getting the data — it is having somewhere to compute on it without spending days on environment plumbing or waiting on multi-hundred-gigabyte downloads that your machine can’t hold anyway.

The RMBL Compute Hub exists to remove that bottleneck. It is a shared, authenticated, browser-based analysis environment that runs in the cloud, next to the data, with the entire RMBL geospatial software stack already installed and working. You open a browser tab, sign in with GitHub, pick how much horsepower you need, and you are in a fully configured JupyterLab or RStudio session in a couple of minutes. No installs, no downloads, no version drift — just analysis.

What it actually is

Under the hood, the Compute Hub is a multi-user JupyterHub running on Amazon’s managed Kubernetes service (EKS). You don’t need to know or care about any of that to use it — but a few of its properties are worth understanding because they shape how you should work.

It runs in the cloud, in the same AWS region as the data. All of RMBL’s spatial data lives in S3 in AWS’s us-east-2 region. The Compute Hub runs there too. That co-location is the whole game: when your notebook reads a slice of a cloud-optimized GeoTIFF or a chunk of a hyperspectral mosaic, the bytes travel across Amazon’s internal network at high bandwidth rather than down to your office over the internet. A read that would take minutes on your laptop takes seconds here.

Everything is preinstalled and consistent. Every session launches from the same versioned container image, which bakes in GDAL, PROJ, GEOS, the full PyData geospatial stack (xarray, dask, rioxarray, geopandas, zarr, netCDF4, s3fs), the R geospatial stack (terra, sf, stars, the tidyverse), and — most importantly — RMBL’s own pysdp and rSDP client packages, ready to import. When a colleague says “run my notebook,” it runs, because you are both standing on the identical software foundation.

You choose your hardware at login. After signing in you pick a profile:

Profile	CPUs	RAM	GPU	Best for
Small	2	8 GB	—	Exploratory work, small files, writing code
Medium	4	16 GB	—	Moderate analysis
Large	8	32 GB	—	Large in-memory datasets
GPU	4	16 GB	1	GPU-accelerated computation

The cluster scales the underlying machines up when someone needs them and back down to zero when nobody is working — which is what keeps a facility of this capability affordable for an organization RMBL’s size.

Your work persists. Each user gets a private 50 GB home directory that survives between sessions. Your notebooks, scripts, intermediate outputs, and any extra packages you install are all still there next time you log in. Your home directory is private — other users cannot see your files.

Access is gated, not public. Sign-in is through GitHub OAuth, restricted to an allowlist of RMBL collaborators. There is no public sign-up. This is a working environment for the research community, not a public service.

The two worlds of data

The single most important thing to understand about working on the Hub is that it gives you two complementary ways to reach data, and choosing the right one for the job is most of what separates a smooth session from a frustrating one.

The SDP catalog — the recommended default

For the overwhelming majority of workflows, the right entry point is the Spatial Data Platform catalog. The SDP is RMBL’s curated library of cloud-optimized GeoTIFFs covering western Colorado — daily and modeled climate (temperature, precipitation), snow, vegetation indices, hydrology, and terrain — organized into a clean catalog with stable six-character product IDs.

You reach it through the preinstalled client packages, which are the primary, intended data API for the Hub: pysdp in Python and rSDP in R. The catalog ships baked into both packages, so discovery is instantaneous and works even offline. The underlying rasters are read lazily — nothing is downloaded until you actually slice or compute, and then only the bytes you asked for move.

In Python:

import pysdp

# Discover datasets — returns a pandas DataFrame
cat = pysdp.get_catalog(domains=["UG"], types=["Climate"], timeseries_types=["Daily"])
cat[["CatalogID", "Product", "MinDate", "MaxDate"]].head()

# Open a two-week slice of daily max temperature — lazy, nothing downloaded yet
tmax = pysdp.open_raster("R4D004", date_start="2022-07-01", date_end="2022-07-14")

The exact same operations in R, with the same vocabulary, returning native terra objects:

library(rSDP)

cat <- sdp_get_catalog(domains = "UG", types = "Climate", timeseries_types = "Daily")
head(cat[, c("CatalogID", "Product", "MinDate", "MaxDate")])

tmax <- sdp_get_raster("R4D004",
                       date_start = as.Date("2022-07-01"),
                       date_end   = as.Date("2022-07-14"))

This R/Python parity is deliberate and runs deep. pysdp is a feature-for-feature port of rSDP — same catalog, same vocabulary, same operations, the same uniform time handling across daily/monthly/yearly products. The intent is that a lab can have R people and Python people working side by side on the same data without ever having to translate one mental model into the other. Pick the language you think in; the data behaves identically either way.

The CHESS-specific datasets

The second world is the in-house data that doesn’t (yet) live in the public SDP catalog — most prominently the NEON AOP imaging-spectrometer mosaics: 426-band hyperspectral imagery over several research domains and years, totaling several terabytes. These live on the rmbl-chess-data S3 bucket, and Hub users have read-only access to them.

These mosaics are physically stored as hundreds of individual NetCDF tiles per domain-year, which would be miserable to work with one file at a time. To fix that, we expose each domain-year as a virtual Zarr store (built with Icechunk) so the whole thing opens as a single xarray.Dataset spanning all the underlying tiles — with no data duplication. You open one path and get a coherent, chunked, lazily loaded cube of imagery.

For ad-hoc access — listing files, grabbing a specific tile — s3fs/boto3 in Python or aws.s3 in R work directly against the bucket. The bucket is also FUSE-mounted at /data/shared/ for casual browsing, but for any serious read, direct SDK or Zarr access is substantially faster than going through the mount.

The mental model to carry: reach for the SDP catalog and pysdp/rSDP first. Drop down to direct rmbl-chess-data access when, and only when, you need the CHESS-specific products that aren’t in the catalog.

How to get the most out of it

The Hub rewards a few habits. None are difficult, and together they’re the difference between fighting the environment and forgetting it’s there.

Start small and size up deliberately

Begin every project on the Small profile. Writing code, debugging, and exploring a catalog need almost no resources, and a Small server starts fastest and costs least. When you hit a wall — a computation that’s genuinely memory-bound, or one you want to parallelize across more cores — stop your server and restart on a larger profile. Re-sizing takes a minute and your home directory comes with you. Reflexively grabbing Large “to be safe” is the single most common way the shared budget gets burned for no benefit.

Let the data stay lazy, and filter before you compute

The cardinal rule of cloud-native geospatial work: subset before you materialize. Both pysdp and xarray give you lazy handles to enormous datasets. The skill is in narrowing — by time, by spatial window, by variable — before you trigger an actual read or computation. Open a multi-year raster, slice it to your study polygon and your season of interest, then compute. Done right, you can do real science against a multi-terabyte mosaic on a Small server, because you only ever touch the kilobytes you need.

Use Dask when data outgrows memory

When a dataset genuinely won’t fit in RAM, Dask gives you out-of-core, parallel computation. pysdp.open_raster() already returns Dask-backed arrays; you just attach a client to light up parallel execution and a live progress dashboard:

import pysdp
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=2, memory_limit="3GB")

tmax = pysdp.open_raster("R4D004", date_start="2019-10-01", date_end="2022-09-30")
var  = tmax["bayes_tmax_est"]

# Spatially subset and reduce *before* computing
box    = var.sel(x=slice(325000, 335000), y=slice(4320000, 4310000))
annual = box.groupby("time.year").mean("time").compute()

The bundled tutorial 02_sdp_with_dask.ipynb walks through this in full, including multi-product math.

Pick the right interface for the job

Once your server starts you land in JupyterLab (in the RMBL Dark theme — switch to the default under Settings → Theme if you prefer). From the launcher you can also open:

RStudio — the full RStudio Server IDE, for those who live in R. It runs inside the same session with the same data access; library(rSDP) just works.
A terminal — for git, file wrangling, and shell tools.

You’re not locked into one. A common pattern is exploratory analysis in JupyterLab and final figures or modeling in RStudio, against the same files in the same home directory.

Make your home directory work for you

Your 50 GB home directory persists, so use it as your project workspace. Clone your team’s GitHub repositories into it, keep your notebooks under version control, and commit and push regularly — that’s both your backup and how you share code with collaborators. You can install extra packages that survive restarts, with one important nuance:

pip install --user some-package      # persists across sessions

install.packages("some-package")     # installs into your home library, persists

In Python the --user flag matters: a plain pip install lands in the system environment and is wiped on restart, whereas --user installs into your home directory and sticks.

Be a good citizen of a shared, metered resource

The Hub is shared infrastructure on a real budget — roughly a few hundred dollars a month, with spot pricing and scale-to-zero keeping it affordable. Two habits keep it that way for everyone:

Stop your server when you’re done (File → Hub Control Panel → Stop My Server). This is especially important on Large and GPU profiles, which hold more expensive machines. Idle servers are culled automatically after 30–60 minutes and capped at a 4–8 hour hard lifetime, but stopping manually frees resources immediately.
Save often. When a server is culled, files on disk are safe but anything unsaved in a running kernel’s memory is lost. Treat culling as expected, not exceptional — it’s the mechanism that keeps the lights affordable.

Where it fits in the RMBL data ecosystem

The Compute Hub is not a standalone product. It’s the heavy-compute end of a deliberately designed stack of RMBL data tools, all of which read from the same backend and speak the same vocabulary. Knowing the whole stack lets you move fluidly from “what data exists?” to “let me run a real analysis on it.”

The Spatial Data Platform is the shared foundation. It’s a static catalog of cloud-optimized GeoTIFFs in S3 — the single source of truth that several different tools read from in different ways. Everything below points back to it.

The Spatial Data Platform Browser is the no-code front door. Pan and zoom around the catalog in your browser, overlay RMBL research-site polygons and jurisdictional boundaries, click to pull values from a layer, and — crucially — copy out a generated rSDP code snippet for any view you’ve built. That snippet is the bridge: it drops straight into a Compute Hub session and turns a thing you looked at in the Browser into a reproducible analysis you can build on. The intended motion is Browser to discover and frame, Hub to compute.

pySDP and rSDP are the catalog’s programmatic clients, and they’re the connective tissue between the Browser and the Hub. They’re preinstalled here, but they’re also pip install/install_github packages you can run on your own laptop — the Hub simply gives them a fast, pre-configured, data-adjacent home. Anything you prototype locally runs unchanged on the Hub, and vice versa.

The Knowledge Commons sits alongside the geospatial stack rather than inside it, and it answers a different question: what’s already known? It unifies ~5,200 publications, ~1,200 datasets, and ~1,400 community documents from the upper East River watershed into a single searchable citation- and concept-graph, with an AI-assistant interface on top. A natural workflow: find the relevant literature and prior datasets in Knowledge Commons, then come to the Hub to run your own analysis informed by them.

Bloom Forecast is an example of what gets built on this foundation and delivered back out to a wider audience — a weekly map of wildflower bloom probability for 14 species, combining long-term phenology records with current-year climate. It’s the kind of applied, derivative product that starts as an analysis (the sort of work the Hub is for) and graduates into a public-facing tool.

The ArcGIS Online portal and research-sites map supply the authoritative vector reference layers — research-site polygons, wilderness and district boundaries, roads and trails — that show up as overlays in the Browser and anchor field planning. When your Hub analysis needs to be clipped to official site boundaries, those layers are the canonical source.

The through-line: the SDP catalog is the shared backend; the Browser is for discovery; pySDP/rSDP are the shared language; Knowledge Commons is the literature context; and the Compute Hub is where the serious, reproducible, heavy-data analysis actually happens — with the whole stack already installed and pointed at the data.

Getting started today

Here is the fastest path from reading this to doing real work:

Sign in. Go to rmblcomputehub.org and click Sign in with GitHub. If your GitHub username isn’t yet on the allowlist (a “403 Forbidden” is the tell), email the admin to be added.
Launch a Small server. Pick the Small profile and click Start server. The first launch can take 2–5 minutes while the cluster scales up a machine — that’s normal, and only happens on a cold start.
Open the Welcome notebook. A Welcome.ipynb is placed in your home directory on every spawn — it’s a guided tour with runnable examples for both data worlds.
Work through the tutorials. The bundled tutorials/ folder is the fastest way to internalize the patterns:
- 01_sdp_catalog_basics.ipynb — discovering and opening SDP data (Python)
- 02_sdp_with_dask.ipynb — SDP at scale with Dask (Python)
- 03_distributed_analysis_with_dask.ipynb — Dask patterns on AOP mosaics
- 04_spectral_indices.ipynb — hyperspectral analysis on the AOP data
- 05_sdp_basics_R.ipynb — the same SDP basics, in R via rSDP
Bring your own project. Clone a repo into your home directory, install whatever extra packages you need (pip install --user … / install.packages(…)), and go.

If something breaks, the user guide has a troubleshooting section covering the usual suspects — servers slow to start (the cluster is scaling), kernels dying (you ran out of memory; use Dask or a bigger profile), and packages vanishing after a restart (use --user). For anything else, contact the admin or open an issue on the project repository.

What’s next

The Hub is launching as a working tool, and it will keep growing with the community’s needs. On the near horizon: pinning the pysdp/rSDP versions per release for groups that value reproducibility over freshness; finishing and verifying the virtual Zarr stores across all AOP domain-years; a documented EFS backup policy for home directories; and evaluating in-notebook AI coding assistants. Longer term we’re watching whether a shared distributed-compute layer (Dask Gateway across multiple pods) and lighter-weight GitOps-style image updates earn their keep.

If you have a workflow that doesn’t fit, a dataset you wish were mounted, or a profile size that’s wrong for your work, that feedback directly shapes where this goes. The Hub is infrastructure for RMBL science — the more we know about the science you’re trying to do on it, the better we can make it.

Welcome aboard. See you at rmblcomputehub.org.