Matthias Gallé, Matias Tealdi
21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10-13 August, 2015.
We analyze the problem of reconstructing documents when
we only have access to the n-grams and their counts from
the original documents and a xed n. Formally, we are in-
terested in recovering the longest contiguous substrings of
whose presence in the original documents we are certain.
We map this problem on a de Bruijn graph, where the n-
grams form the edges and where every Eulerian cycles gives
a plausible reconstruction.
We de ne two rules that reduce this graph in such a way
of preserving all possible reconstruction, while at the same
time increasing the length of the edge labels. From a theo-
retical perspective we prove that the iterative application of
these rules gives an irreducible graph equivalent to the orig-
inal one. We then apply this on the data from the Guten-
berg project to measure the number and size of the obtained
longest substrings. Moreoever, we analyze how the n-gram
corpus could be noised to prevent reconstruction, showing
empirically that removing low frequent n-grams has little
impact. Instead, we propose another method consisting
in adding strategically ctitious n-grams and show that a
noised corpus like that is much harder to reconstruct while
increasing only little the perplexity of a language model obtained through it.
Report number: