Intro to Graph Databases Episode #4 – (RDBMS+SQL) to (Graphs+Cypher)

Intro to Graph Databases Episode #4 – (RDBMS+SQL) to (Graphs+Cypher)

Hello. My name is Ryan Boyd, and this is your
Intro to Graph Databases series. This is episode four, covering steps to success with a graph
database, plus how you move your most connected data from a relational database to a graph
database. Now, there are three basic steps to using Neo4j, and the first step is creating
your model. Now, you create your model in advance: your labels, your nodes, what’s going
to be your properties, what are going to be your relationships, you create that in advance.
Even though Neo4j is schema-optional, and you can add and remove these things on the
fly later, you want to create it in advance for your next step. And that is loading your
data. With Neo4j, you load your data using the world’s most common file format – a CSV
file. And this is super easy to do, and it’s very quick because if you have your data in
a relational database, it’s usually easy to export this data as a CSV file. And then the
command line tools for Neo4j allow you to load data into Neo4j at millions of nodes
and relationships every second. So, this is a really easy process, even though later – once
your application is live – you’re going to be loading your data through your application
directly. Now, the third step to using Neo4j is querying it. That’s why you’re using a
graph database, after all, is to be able to run queries in an efficient way – in an intuitive
way – across your data. So querying your data. How do you query your data with Neo4j? Well,
Neo4j comes with a built-in web application that allows you to query your data and see
the results. It looks like this. This web application allows you to see your results
either in a graph form or in a tabular form, allows you to store your favorite queries
and otherwise explore your data. Now this is a fantastic tool, but it is meant for developers
who are working on your application and trying to decide what queries to encode in your application.
That is because Neo4j isn’t really an analytical database. Neo4j is intended to be your ACID-compliant
transactional database to store your most critical business data. So how are you going
to query it in most cases? In most cases, your developers are going to write code. And
that code is going to query Neo4j servers using Neo4j’s APIs. Now, how can you write
that code? Well, there are a variety of different language drivers in JavaScript, in Java, in
Ruby, in .NET, in Python, in PHP. All of these different language drivers allow you to write
code which queries Neo4j. Now, these language drivers– many of them are written by Neo4j,
and some of the others are written and contributed to by the Neo4j community. And the community
has contributed a wide variety of other drivers as well, for up-and-coming languages like
Haskell and Go. But what if you need to squeeze every last ounce of performance out of your
Neo4j queries? If you want to shave off that last couple milliseconds? Well you can do
that by writing native server-side extensions for Neo4j in Java. With your native server-side
extensions, you are hard-coding how to traverse the graph for your queries rather than letting
the Cypher query optimizer decide how to traverse the graph for you. And this allows you – oftentimes
because of your intimate knowledge of how your graph works – allows you to get a few
extra milliseconds better in your response times. Now we believe in the concept of polyglot
persistence. This means that your application will access the database that best suits the
type of data that you need to access. If you have tabular data of columns and rows, it
might be best stored in a relational database. But if you have data that’s truly relational,
that’s truly connected, you’re going to store that data in your graph database. But there’s
other types of data storage as well. Key-value data storage, document data storage, other
things which we discussed in previous episodes of this series. And with Neo4j, we believe
that you should have those other databases sitting alongside Neo4j as part of your application’s
architecture. So your application can access any of these types of databases, can also
access Neo4j, and then you separately have your data scientist who also accesses these
databases and bulk analytic infrastructure in order to determine what query should be
coded into your application to make real-time decisions. Now moving from relational database
to graph. You have a couple different options. We have some customers that their data is
all highly connected. They’re performing tons of joins in their SQL queries today, and it’s
proven that their data is highly connected. So they move all of that data over to Neo4j
and into a graph database to take advantage of the fast performance and the intuitive
querying of that connected data. Now, we have other customers that migrate just a subset
of their data. They look at their data that’s highly connected and they move that over to
Neo4j, while keeping the tabular data or key-value type data over in other databases. And then
their application decides, on a query-by-query basis, where the data is located and queries
the data in the appropriate spot. Of course, because your application has the ability to
decide where to query your data, you can also just migrate or keep all your data in sync
between your relational database and your graph database. And then let your application
decide, on a query-by-query basis, where is the best place to query your data, given that
it exists in both places. So, now let’s talk about how to move from relational to graph.
And first of all, why might you want to move from relational to graph? This is a quote
from David Meza– and I love this quote because David happens to be the Chief Knowledge Architect
at NASA, which we like to think is full of tons of smart people. And he says, “I love
Neo4j because I can explore relationships faster than you can say SQL join.” Before
we get into this, I just want to say that you’re going to see some handwritten drawings,
some tables, some handwritten text. And I hired this young gentleman to write that text.
So, if you can’t read it, if it looks like a young gentleman wrote that text, well, he
did. So don’t blame me, please. Now a day in the life of a relational database developer.
A relational database developer’s ideal table looks something like this. We have people
and the country that they’re from, their hair color, the university they attended, and all
of this data is in a table that’s really easy to read. It’d be easy to onboard new people
onto the team because they would easily understand what’s going on here. But it’s hard to maintain
consistency when you have your data in a form like this. You can see that we use UK multiple
times, you can see that we reference Princeton multiple times. And what happens if the name
of one of these countries – or more likely one of the universities – changes? Well we
want to maintain consistency. So, we create another table. We create another table with
the names of each of the countries. And this allows us to populate some additional information
about those countries, such as the leader, and all while maintaining consistency. But
you can see that we’re using these auto-generated numerical primary keys, which then we’re going
to reference as foreign keys in our other table. And this table becomes a little bit
harder to read. And then when you do this with multiple tables – you do this with university
as well – the table becomes very difficult to read. Now this is just trying to read one
table as a developer, if you’re looking at the raw table in your database. Of course,
if you’re doing this for real, you’re going to also have views and you’re going to have
SQL queries you can execute, but all so you can read and understand your data. And then
you’re going to have to produce an ER diagram that shows how everything interrelates. And
you’re going to use this ER diagram to onboard new members of your team. And then you’re
going to write SQL queries, and those SQL queries are going to have your join statements.
Now, this isn’t too bad when we just have these three tables. But we have a lot of customers,
and a lot of people in the community that I’ve spoken to, that have 20, 30, 40 different
tables that they’re joining together – and their ER diagram includes just as many tables
– then it’s very difficult to communicate how the database works when you onboard someone
new, either as a DBA or as an application developer. And most importantly, because these
queries are so complex and hard to write, you have your relational database developer.
And she is thinking about these joins all day long, all night long. They’re even keeping
her up at night, and making it less performant for her to do her job. Now, not only is her
job a little less performant, but when you have this many joins, your database is also
less performant. Because your database relies upon searching all of the data. Now you can
make this a little bit better by creating indexes – which you certainly should – and
this prevents your full table scans, but you create an index on every single primary key.
And that means every time you’re doing this join operation, you are looking at a different
index. Now imagine if you didn’t just have to look through this one drawer in this card
catalog, but you had to look through 20, 30, or 40 different drawers in this card catalog
to perform a query. That would be painful. It’s painful for the database as well, and
that results in reduced performance as your data size increases. So to deal with a reduced
performance and enable faster querying of your database, oftentimes what happens next
is you denormalize your data. You take your data back out of the normalized form and create
a denormalized form of your data that allows you to query your data more quickly. But again,
this has the problems of inconsistencies. It also has a problem of write time performance.
You have to update every single row if you updated the name of that country or the name
of that university. So this becomes difficult to maintain. So with SQL, developers often
find it hard to create their queries and model their relationships. The performance degrades
as the data size increases because of all of those joins and all of those index lookups
that we talked about. And the queries get really long and complex due to the same. Maintenance
becomes very painful because you don’t know how all your data is working. And if you have
to add new tables to your join statements that are 20 or 30 or 40 joins long, this is
very challenging. But what if we converted our data into something like this? What if
we converted our data into a graph – a graph that can be easily read using the English
language? So in this case here, you can see here that a person went to a university. That
university is located in a state. That university is led by a president or a leader. All very
easy to read. And the queries are actually very easy to read in Cypher, as well. If you
remember, Cypher is our declarative query language that’s similar to SQL, but optimized
for graphs. And it’s optimized for readability. You can see here that we’re matching a person–
now that P is just an alias. So we’re saying, matching a person – call it P – who went to
a university – call it U. And that person lives in a country, and that university is
led by a leader, and that university is located in a state where the state’s abbreviation
is Connecticut. And then return the various columns, similar to how you would return your
data in a relational database. This is pretty awesome, and much easier to read and understand
than a bunch of joined statements. Now let’s talk about how you convert your data from
a relational model into a graph model. And this is pretty simple, and I’m going to show
you the common steps that you use to convert your data. Well first of all, we have our
ER diagram. And if you’re like me and many other database developers out there that work
with relational databases, just seeing this ER diagram kind of makes you sad. Because
you have to look throughout it and see where all the primary keys and foreign keys are,
and understand that before building your application. Now this ER diagram has things like customers
and orders, and what products are in those orders, and salespeople and what territories
they cover. Now, you may recognize this dataset as the Northwind dataset. Northwind is the
canonical relational database example. It was originally created by Microsoft and distributed
as demo data with SQL server. But we’re going to show you how to convert Northwind to a
graph database, and hopefully make some of your people – who were sad when looking at
an ER diagram – happy when looking at the graph model. And hopefully, also show you
that Northwind is much better as a graph than it is in a relational database. So we start
off with our ER diagram. And we first locate our foreign keys and we eliminate the foreign
keys, and we replace those with relationships. Next we locate our JOIN tables, and you can
see here that we have two JOIN tables. One JOIN table joins our orders with the items
that are in the order, and the other JOIN table joins our employees with the territories
that they represent. Now the one that represents the relationship between our employees and
territories is a really simple JOIN table. And this simple JOIN table just becomes a
relationship. But the other JOIN table we have here that’s showing the items that are
in an order, and what sort of quantity was placed with that order, and what the price
was, and things like that, that’s a little more complex. We convert that attributed JOIN
table to a relationship, but that relationship has a set of properties as well. Because as
you remember with Neo4j, not only can your nodes have properties but your relationships
can have properties. Today we’re going to talk about a subset of your data here. And
we’re converting that subset of data into a graph. And you can see this graph is already
much easier to read than the ER diagram that we were looking at before. It’s easy to understand
what the data is and how the data works, and thus easy to onboard new members of your team.
So I think I’ve shown you that graphs make it much easier to model and query your relationships.
Your performance overall increases versus a relational database, because as the size
of your data increases with a graph database – in particular with Neo4j – it does not reduce
the query times of querying the subsets of your graph. And this is because Neo4j uses
what’s called index-free adjacency. What it means is as you’re traversing the graph, as
you’re moving from one node to another node over a relationship, it’s super, super quick.
It’s essentially just doing pointer arithmetic, either on memory or on disk, instead of doing
index lookups. And this gives us that important performance advantages, in addition to being
more intuitive to model your data and query your data. And your queries are much smaller
and much more understandable. And then adding additional properties and relationships on
the fly makes it much more agile. There’s no more schema migrations, which all of you
relational database developers and DBAs are probably sick of those keeping you up at night
as well. So next up, we’re going to show you how to query your graph in Neo4j, using the
Cypher declarative query language. Thank you very much, and hope you have a fantastic day.

Danny Hutson

29 thoughts on “Intro to Graph Databases Episode #4 – (RDBMS+SQL) to (Graphs+Cypher)

  1. Looking at multiple indexes is not painful. The computer does it. They don't feel pain. Personification not good idea.

  2. Hey Ryan,
    This tutorial was simple, awesome and comprehensive. At the end though, it says "Next Up: Querying the Graph"
    The usual interval between the videos in the series have been 3-4 days but the next one seems to be taking a lot of time.
    Although the documentation and other resources on the official site are great, I am eagerly awaiting for the rest of the series.

  3. "This person is sad (no specific reason why)"

    "This person is how happy (no specific reason why)"

    It's so obvious why you should use a graph database with such sound logic! No downsides at all, I'm sure!

  4. "better" than RDBMS seems to reference facts not in evidence. When comparing the SQL and GRAPH queries they were very similar. The joins were present in the GRAPH example but were essentially implied from the schema. Something an RDBMS could implement given proper/complete relationships. I'm sure GRAPH QL is just fine for some problems but there are so many more concerns when implementing OLAP vs OLTP systems.

  5. I hope you are not really as dumb as you pretend to be and do not have problems with joining, because it is soo complex and not readable and you have to make queries over 30 tables at once.

  6. Don't trust anything officially from neo4j. They're just trying to sell you.
    watch videos from actual developers who mentions pros AND cons

  7. He looks like a genius talking to a child with simple and small words. He is so holding himself not to go faster 🙂
    If you recognize this at the beginning of the video, you can never watch it as the same way 😛

  8. Hey folks. Just published episode 5, introducing Cypher. Sorry for the long delay. The last episode was published 3 days before my daughter was born. Lots of sleepless nights doesn’t make for high quality single-handed video recording! Cheers! -Ryan

  9. Intro to Neo4J – here's 17 minutes telling you what's wrong with RDBMS. I'm nearly an hour into trying to find out about Neo4J *and I still hardly know anything about it*, "intuitive and fast" yeah, everyone says that.
    I'm not surprised you struggle to read diagrams like the one at 13:00, it's not a proper ER diagram, the keys aren't marked and there's no cardinality annotations. That's an F on any course.
    Northwind was first distributed with Access 1.0 btw.

  10. 12:13 the query has an error, maybe someone else mentioned it but the:

    (u) – [:LED_BY] -> (l:Leader)
    …should have been…
    (c) – [:LED_BY] -> (l:Leader)

    since u doesn't have that kind of relationship (LED_BY).

  11. These six simple tutorials are much better for the core understanding of neo4j than hundreds of technical manuals. Thanks for the job well done.

Leave a Reply

Your email address will not be published. Required fields are marked *