5

I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?

closed as too broad by VMAtm, Maroun, SMA, default locale, CRABOLO Jan 27 '15 at 5:08

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

  • Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. – Eliyahu Jan 4 '15 at 10:40
  • Not sure why this question is voted off as "too broad" - there is a specific question at hand, and @Program_Dude also provided what he has tried and why it failed. – amit Jan 4 '15 at 10:40
  • @Eliyahu He did. – amit Jan 4 '15 at 10:40
  • @Amit what he did? he asked for some code? So it is off-topic – Eliyahu Jan 4 '15 at 10:42
  • 1
    @Eliyahu It is completely on-topic. He described a problem, described a solution, explained why the suggested solution fails, and asked for an alternative. More "advanced" problem than "How to parse a string?" is perfectly on topic. – amit Jan 4 '15 at 10:43
3

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

  • the answer is quite useful. Thanks. – Rohith K Jan 4 '15 at 11:06
2

I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:

 1. Create a suffix tree from the string, let it be `T`.
 2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
 3. For each node `n` in `S`, do the following:
     3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
     3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`

Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.

This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.

Not the answer you're looking for? Browse other questions tagged or ask your own question.