I needed to quickly investigate some data and used the following pattern to do so. Using the model of an e-commerce store as example, here's what I did.
The Setup
First, setup a cube.js
defining the dbType
:
module.exports = {
dbType: 'duckdb',
}
Then, define the driverFactory
, and configure the access to the GCS bucket:
const driverFactory = async () => {
const { DuckDBDriver } = require('@cubejs-backend/duckdb-driver');
return new DuckDBDriver({
database: ':memory:',
initSql: `
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET (
TYPE gcs,
KEY_ID 'YOUR_KEY_ID',
SECRET 'YOUR_SECRET_ID'
);
`,
});
}
Finally, add to initSql
the commands to setup our views and access our data:
CREATE VIEW orders AS
SELECT * FROM read_parquet('gs://tmp-bucket/orders/*.parquet');
CREATE VIEW products AS
SELECT * FROM read_parquet('gs://tmp-bucket/products/*.parquet');
CREATE VIEW line_items AS
SELECT * FROM read_parquet('gs://tmp-bucket/line_items/*.parquet');
CREATE VIEW users AS
SELECT * FROM read_parquet('gs://tmp-bucket/users/*.parquet');
CREATE VIEW inventory_transactions AS
SELECT * FROM read_parquet('gs://tmp-bucket/inventory/*.parquet');
Key Implementation Details
Views, Not Tables: I specifically used CREATE VIEW
instead of CREATE TABLE AS
to avoid loading all the data into memory. Views just store the query definition - DuckDB streams the Parquet data chunk by chunk when you actually query it. With tables, DuckDB would copy everything into its internal format, eating up memory or disk space.
Zero Infrastructure: The :memory:
database means no provisioning. Each Cube process gets a fresh DuckDB instance that reads directly from existing Parquet files.
Direct Access: No ETL pipeline, no data loading. The read_parquet()
function streams data directly from GCS, and DuckDB's columnar engine only fetches the columns needed for each query.
Wildcard Patterns: The /*.parquet
pattern lets you treat multiple files as a single table without any preprocessing. Perfect for data that's already partitioned by date or other dimensions.
Performance Characteristics
In this example, you could run complex queries across orders, products, line items, and inventory - joining across millions of rows within seconds. What would normally require hours of data pipeline setup can be done immediately.
DuckDB's aggressive caching means the first query might take a few seconds to fetch metadata, but subsequent queries run pretty much instantly even on larger datasets.
Trade-offs
Benefits:
- Zero data movement—queries run directly against GCS files
- No infrastructure to manage
- Instant setup for ad-hoc analysis
- Reuses existing Parquet files from other processes
Drawbacks:
:memory:
means state vanishes on restart- Not suitable for production workloads
- Performance degrades on datasets larger than available RAM
Notes
This pattern works for any quick data investigation. Got Parquet files? Throw them in GCS, point Cube at them with this config, and you're analyzing data in minutes. For more permanent setups, swap :memory:
for a file path to persist the DuckDB catalog between restarts.
Your data lake is already an OLAP database. You just need DuckDB to unlock it.