Virtualization promises to provide all the benefits of a knowledge graph without requiring data engineers to move data permanently into graph storage. It’s a popular feature of modern knowledge graph platforms, and new companies seem to be popping up promising a “semantic layer” with a virtualization strategy.
Virtualization is essentially query rewriting in which the original query in, say, SPARQL is rewritten for a connected source database to temporarily acquire data and process it in RDF format. Some platforms do some fancy work with caching and other tricks to make it more efficient, but underneath there is still a query to retrieve data from wherever it’s stored.
With this feature you can side-step building a new data pipeline to ETL data into yet-another-format, as well as avoid additional storage costs. (Almost no one seems to consider simply replacing a relational database with a graph database, which is a pity, but that’s a different story.) Of course—like any technology solution—virtualization is not a silver bullet. There are limitations and downsides that must be considered.
Enterprises frequently ask about whether and when to virtualize or to materialize data in a graph format. The answer to this question is always unattractive: “It depends.” Here are some considerations to keep in mind:
- How much data?
Graph databases work on pattern matching and will attempt to return all data that matches the requested pattern. Therefore, they need to evaluate all possible data against the pattern. That means loading lots of data into memory and processing it in one big block. The more data that possibly matches the pattern, the longer the processing time. There are several strategies to mitigate this, but the underlying fundamentals always apply. - What is the nature of the data?
Relational databases are famous for extremely quick processing of transactional data. Graph databases are infamously less efficient at these sorts of calculations. If you’re processing lots of transactional data that updates frequently – point of sale data, sensor data, inventory data, etc. – you probably want to store that in a relational database and have it perform running calculations while the graph looks for patterns at a point in time rather than adding new statements to the graph and deleting any existing statements that are no longer true. - What are the underlying databases where data is stored?
Knowledge graph platforms can generally rewrite SPARQL queries into several formats, most notably SQL, but also XQuery and other NoSQL formats. Think about what you are going to ask of those underlying databases and whether they will be up to the challenge. Does your knowledge graph have an efficient way of querying for the data in the underlying format or will it need to be extended with programming? - What do you want to do with the data?
This is the big one. Complex queries that process lots of data with many joins are going to be much more efficient if the data is already in graph format (RDF). Frequent calls back to the storage layer to constantly update data is a poor use case for virtualization. Then again, there are many scenarios where accuracy or completeness is more important than speed, so if processing time is not a factor (e.g., a report that runs over-night), then virtualization may be an effective option.
Remember, when virtualizing you’re employing two data layers at run time—not one. So, performance of “virtual” queries depends on the performance characteristics of both layers. If the purpose of your knowledge graph is to perform queries that your SQL database doesn’t do well, keep in mind that you’re probably still asking the underlying SQL database to do something it’s not good at. You may also be asking the underlying database for a lot of data at once for the graph database to process for pattern matching. If the memory or processing power allocated to that underlying database is not adequate, it will affect performance of the graph layer.
Some enterprises choose to employ a hybrid strategy to process underlying data for, say, conditional categorization, and store the results as metadata permanently in the graph layer. This works well when the conditionality is consistent and permanent. That is, if we are assigning some uniquely identified “thing” to a Class based on some conditions, it’s best if those conditions once met don’t change. Otherwise, you must reevaluate data to determine if those conditions still apply, which limits the value of storing “X is a Y” statements in the first place.
Virtualization can be a great way to add a semantic processing capability without a lot of new data pre-processing. It works best under certain conditions and purposes, but not every condition nor purpose. You can follow some of the rules of thumb I’ve outlined here, but the best approach is to test your data with your graph database in real-world scenarios considering speed, accuracy, and long-term maintenance of your data infrastructure.