Today the Syndio dev team took a field trip to the Chicago stop of the Neo4j Graph Days tour. Since we pride ourselves in our ability to elicit information from networks, using a graph database such as
Neo4j for our future data storage & modeling needs seems to be an inevitability. Here are a few takeaways from the conference…
Simplification of the Data Model
Currently we store our data in
PostgreSQL (a RDB). Naturally, our data model involves a bunch of join tables and intermediaries to keep track of network relationships. Join tables and foreign keys are a necessary evil when attempting to store data that is inherently graph-like within the constraints of an RDB.
Neo4j simplifies the modeling of graph-like data significantly by allowing the DBA to define node types and edge types between those nodes. Schemas are much more flexible, and changes do not need to be formally added via explicit migrations like in RDB land. This is actually a pretty awesome feature, and it compliments the quickly changing requirements of a startup landscape nicely.
Based on reports from customers who made the switch, complex queries that previously took minutes, once modeled and executed in
Neo4j, took milliseconds. The
Neo4j marketing team really hooked in to this sentiment, coining the phrase minutes to milliseconds and peppering it throughout their presentation material.
Network Metrics in Real Time
We pre-calculate most of our network metrics via an offline analytics package, which results in a fairly static experience for the end user. Users want to see what adding a new connection to their network will do to overall scores! They want to create teams of people, and get estimations of their capabilities as a group! This type of real-time analysis is not possible under the current constraints of our system.
Fortunately, many of the network metrics that we pre-calculate are mere queries to the database once data has been modeled in a
Neo4j instance! The example we saw during the Graph Day training session was for shortest path:
Neo4j offers a built in function that can be used to return the shortest path between two nodes in a graph with minimal delay. Functions such as these will reduce the complexity of our analytics routine, and will allow users to see the impact of hypothetical changes to their network in a dynamic way.
The Cost of Speed
Sure minutes to milliseconds sounds great, but at what cost? With
Neo4j that cost is Memory, and apparently it can be quite a debilitating constraint for large datasets. Based on my limited understanding, the way
Neo4j is able to query so quickly is by holding a copy of the graph in memory. For complex graphs with lots of nodes, relationships, and properties, this can mean a lot of memory usage! A Senior Developer from the Neo team even went so far as to suggest that if query performance really starts to slow down… just fire the worst developer on your team and use his salary to buy more RAM!
Memory concerns aside, we work with network data, and using an RDB to store and model it is starting to show its cracks…I guarantee we have a few of those minutes to milliseconds type queries in our app that could benefit from some of Neo4j’s performance. Queries that return network metrics will allow us to simplify the calculations in our analytics routine, and will likely allow us to delete entire chunks of the analytics entirely.
Off the shelf graph databases like
Neo4j didn’t exist when Google, Facebook, and LinkedIn were building their products…so they each wound up rolling their own. Now, LinkedIn is using
Neo4j to rebuild their platform in China! If it’s good enough for enterprise tech companies, I’m sure it will be sufficient for Syndio’s purposes.