We're going to create a cluster with 2 worker nodes and see the distribution one query on multiple tasks
Open 4 terminals
in the zeroth
docker network create citus-net
in the first
docker run --rm --name citus-1 --network citus-net --hostname citus-1 -p 5501:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus:12.1.6-alpine
in the second
docker run --rm --name citus-2 --network citus-net --hostname citus-2 -p 5502:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus:12.1.6-alpine
in the third
docker run --rm --name citus-3 --network citus-net --hostname citus-3 -p 5503:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus:12.1.6-alpine
in the zeroth
docker exec -it citus-1 psql -U postgres
select * from citus_add_node('citus-2', 5432);
select * from citus_add_node('citus-3', 5432);
select citus_set_coordinator_host('citus-1', 5432);
# https://github.com/citusdata/citus?tab=readme-ov-file#creating-distributed-tables
CREATE TABLE events (
device_id bigint,
event_id bigserial,
event_time timestamptz default now(),
data jsonb not null,
PRIMARY KEY (device_id, event_id)
);
SELECT create_distributed_table('events', 'device_id');
SELECT * FROM citus_shards;
SELECT * from pg_dist_shard;
SELECT * from citus_get_active_worker_nodes();
SELECT * FROM citus_check_cluster_node_health();
INSERT INTO events (device_id, data)
SELECT s % 100, ('{"measurement":'||random()||'}')::jsonb FROM generate_series(1,2000000) s;
# https://docs.citusdata.com/en/stable/performance/performance_tuning.html
SET citus.explain_all_tasks = 1;
EXPLAIN (VERBOSE ON) SELECT count(*) FROM events;
you'll get
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Aggregate (cost=250.00..250.02 rows=1 width=8)
Output: COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint)
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=8)
Output: remote_scan.count
Task Count: 32
Tasks Shown: All
-> Task
Query: SELECT count(*) AS count FROM public.events_102008 events WHERE true
Node: host=citus-2 port=5432 dbname=postgres
-> Aggregate (cost=1449.00..1449.01 rows=1 width=8)
Output: count(*)
-> Seq Scan on public.events_102008 events (cost=0.00..1299.00 rows=60000 width=0)
Output: device_id, event_id, event_time, data
-> Task
Query: SELECT count(*) AS count FROM public.events_102009 events WHERE true
Node: host=citus-3 port=5432 dbname=postgres
-> Aggregate (cost=2454.00..2454.01 rows=1 width=8)
Output: count(*)
-> Seq Scan on public.events_102009 events (cost=0.00..2204.00 rows=100000 width=0)
Output: device_id, event_id, event_time, data
...
(230 rows)
Links
https://docs.citusdata.com/en/stable/get_started/concepts.html
https://github.com/citusdata/citus?tab=readme-ov-file#creating-distributed-tables
https://docs.citusdata.com/en/stable/admin_guide/cluster_management.html
https://docs.citusdata.com/en/stable/develop/api_udf.html#citus-add-node
https://docs.citusdata.com/en/stable/develop/api_udf.html#create-distributed-table
https://docs.citusdata.com/en/stable/performance/performance_tuning.html
https://www.cybertec-postgresql.com/en/citus-sharding-your-first-table/
https://docs.citusdata.com/en/v12.0/develop/reference_workarounds.html#limits