In-Memory DuckDB with GCS Parquet Files in Cube

I needed to quickly investigate some data and used the following pattern to do so. Using the model of an e-commerce store as example, here's what I did.

The Setup

First, setup a cube.js defining the dbType:

module.exports = {
  dbType: 'duckdb',
}

Then, define the driverFactory, and configure the access to the GCS bucket:

const driverFactory = async () => {
  const { DuckDBDriver } = require('@cubejs-backend/duckdb-driver');

  return new DuckDBDriver({
    database: ':memory:',
    initSql: `
      INSTALL httpfs;
      LOAD httpfs;

      CREATE SECRET (
        TYPE gcs,
        KEY_ID 'YOUR_KEY_ID',
        SECRET 'YOUR_SECRET_ID'
      );
    `,
  });
}

Finally, add to initSql the commands to setup our views and access our data:

CREATE VIEW orders AS
SELECT * FROM read_parquet('gs://tmp-bucket/orders/*.parquet');

CREATE VIEW products AS
SELECT * FROM read_parquet('gs://tmp-bucket/products/*.parquet');

CREATE VIEW line_items AS
SELECT * FROM read_parquet('gs://tmp-bucket/line_items/*.parquet');

CREATE VIEW users AS
SELECT * FROM read_parquet('gs://tmp-bucket/users/*.parquet');

CREATE VIEW inventory_transactions AS
SELECT * FROM read_parquet('gs://tmp-bucket/inventory/*.parquet');

Key Implementation Details

Views, Not Tables: I specifically used CREATE VIEW instead of CREATE TABLE AS to avoid loading all the data into memory. Views just store the query definition - DuckDB streams the Parquet data chunk by chunk when you actually query it. With tables, DuckDB would copy everything into its internal format, eating up memory or disk space.

Zero Infrastructure: The :memory: database means no provisioning. Each Cube process gets a fresh DuckDB instance that reads directly from existing Parquet files.

Direct Access: No ETL pipeline, no data loading. The read_parquet() function streams data directly from GCS, and DuckDB's columnar engine only fetches the columns needed for each query.

Wildcard Patterns: The /*.parquet pattern lets you treat multiple files as a single table without any preprocessing. Perfect for data that's already partitioned by date or other dimensions.

Performance Characteristics

In this example, you could run complex queries across orders, products, line items, and inventory - joining across millions of rows within seconds. What would normally require hours of data pipeline setup can be done immediately.

DuckDB's aggressive caching means the first query might take a few seconds to fetch metadata, but subsequent queries run pretty much instantly even on larger datasets.

Trade-offs

Benefits:

Zero data movement—queries run directly against GCS files
No infrastructure to manage
Instant setup for ad-hoc analysis
Reuses existing Parquet files from other processes

Drawbacks:

:memory: means state vanishes on restart
Not suitable for production workloads
Performance degrades on datasets larger than available RAM

Notes

This pattern works for any quick data investigation. Got Parquet files? Throw them in GCS, point Cube at them with this config, and you're analyzing data in minutes. For more permanent setups, swap :memory: for a file path to persist the DuckDB catalog between restarts.

Your data lake is already an OLAP database. You just need DuckDB to unlock it.