Hierarchical clustering for property graph schema discovery
Abstract
The property graph model is becoming increasingly popular among users and is currently employed by several open-source and commercial graph database systems. Although property graphs are widely adopted, there is a lack of understanding of their underlying schema structure. In particular, the schema discovery problem consists of extracting the schema concepts from a property graph. A property graph schema helps build a concise description of the data it represents, to make it more digestible for humans and interactive processes, as well as usable for query optimization purposes. In this paper, we address the property graph schema discovery problem and introduce the GMMSchema method based on hierarchical clustering using a Gaussian Mixture Model, which accounts for both label and property information on nodes. We experimentally analyze the accuracy and performance of GMMSchema, compared to those of its closest competitor, and showcase its superiority on several commonly used datasets, including real-world ones, such as the Covid19 knowledge graph, as well as the Fib25 and Mb6 NeuPrint graphs.
Origin | Publisher files allowed on an open archive |
---|