Efficient Frequent Subtree Mining Beyond Forests
A common paradigm in distance-based learning is to embed the instance space into a feature space equipped with a metric and define the dissimilarity between instances by the distance of their images in the feature space. Frequent connected subgraphs are sometimes used to define such feature spaces if the instances are graphs, but identifying the set of frequent connected subgraphs and subsequently computing embeddings for graph instances is computationally intractable. As a result, existing frequent subgraph mining algorithms either restrict the structural complexity of the instance graphs or require exponential delay between the output of subsequent patterns, meaning that distance-based learners lack an efficient way to operate on arbitrary graph data.
This book presents a mining system that gives up the demand on the completeness of the pattern set, and instead guarantees a polynomial delay between subsequent patterns. To complement this, efficient methods devised to compute the embedding of arbitrary graphs into the Hamming space spanned by the pattern set are described. As a result, a system is proposed that allows the efficient application of distance-based learning methods to arbitrary graph databases.
In addition to an introduction and conclusion, the book is divided into chapters covering: preliminaries; related work; probabilistic frequent subtrees; boosted probabilistic frequent subtrees; and fast computation, with a further two chapters on Hamiltonian path for cactus graphs and Poisson binomial distribution.