Unleash analytics on operational data with Hyperscale (Citus) on Azure Database for PostgreSQL

Unleash analytics on operational data with Hyperscale (Citus) on Azure Database for PostgreSQL


Hello everyone, I’m Marco Slot I’m a principal engineer on the
Citus team at Microsoft And I’m going to show you some of the cool things you can do with
Hyperscale (Citus). Imagine you have an application that
generates a really large number of events, for example, GitHub publishes every event
that happens on a public repo as a large JSON object,
via their API. Now, what if you want to take that
huge stream of data and create an analytics dashboard
that gets updated in real time. Responds very quickly to queries And can handle many concurrent users. Finding a database that can do all those
things at scale is challenging. If I were building an application
like this I would probably use Postgres
because it’s so versatile. But when I get to more than a
few hundred gigabytes of data, and my Github data is actually
several terabytes large, uh, my dashboard will start getting slow, far behind. And even if I pick the biggest
Postgres node available. If I build my app using Azure database
for PostgreSQL then starting today I’m no longer
limited to a single node, because I can use the new
Hyperscale (Citus) option to scale out to as many as
hundreds of nodes. Here I created a database
cluster with 10 nodes, with a total of 320 cores and 2.5 terabytes of memory. Now to set up my schema I can use any Postgres tool. Here I’m using Azure Data Studio and I created an events table with a JSON column for my raw JSON data. And to distribute my Postgres table across all 10 nodes, I use the Citus function called
create_distributed_table. Now I don’t want my application
to interact with these bulky
JSON objects directly instead, I want my application
to have a table like GitHub commits where I have a repo_name and
an author_name and a message for every commit. I’m also going to extract the
number of commits
per day for every repository to render to
charts in my dashboard. Now to go from the raw event data to these more structured tables, I’m going to do a transformation
inside my database. I created a function called
extract_commits which takes the newest push events
from the Github events table, extracts the commits from the push events and inserts them into my GitHub
commits table, and then I also update the
daily_github_commits table. and because I’m using
Hyperscale (Citus), this transformation actually
runs in parallel across all of my 100s of cores and it runs as one big
distributed transaction. Now I’m actually ingesting
data and running a transformation right now
in the background. So far, I have over 2 TB
of data in my database. And, you know, I can run
queries against it,
for example, the number of commits for the
last 30 days for one of the
repositories. And in the background, I’m actually replaying the entire
Github history which is several
terabytes of data. Now let’s have a look at all
the things going on concurrently
on my database cluster In the top left, you see my data
ingestion. Streaming about 1 gigabyte of data into the database every few seconds. That’s around 7 million events
per minute, and each event is about
3 kilobytes in size. In the top right, you can see the
extract_commits() function running in a loop transforming and processing
all the incoming events. And in the bottom right, I’m actually running a
benchmarking tool that’s generating over 50,000
dashboard queries per second as if there were thousands of
users querying the dashboard at the same time and they see an average of
2 millisecond response time. Now in addition to all these concurrent
operations, I can also run low-latency updates even on a table with a billion rows and I can scan the full table in
under 2 seconds. No matter how big your data, or how big your data stream, with Hyperscale (Citus) on
Azure Database for PostgreSQL, you can always provision enough CPU, memory and disk to parallelize your workload
across a scaled-out Postgres
cluster. And give your customers a level of performance you just
can’t imagine on a single
Postgres node.

Danny Hutson

Leave a Reply

Your email address will not be published. Required fields are marked *