Find repeating substring of Length N [closed]

Question

I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?

Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. — Eliyahu, Jan 4 '15 at 10:40
Not sure why this question is voted off as "too broad" - there is a specific question at hand, and @Program_Dude also provided what he has tried and why it failed. — amit, Jan 4 '15 at 10:40
@Amit what he did? he asked for some code? So it is off-topic — Eliyahu, Jan 4 '15 at 10:42
@Eliyahu It is completely on-topic. He described a problem, described a solution, explained why the suggested solution fails, and asked for an alternative. More "advanced" problem than "How to parse a string?" is perfectly on topic. — amit, Jan 4 '15 at 10:43

peter.petrov · Accepted Answer · 2015-01-04 11:01:28Z

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

the answer is quite useful. Thanks. – Rohith K Jan 4 '15 at 11:06 — Rohith K, Jan 4 '15 at 11:06

peter.petrov · Accepted Answer · 2015-01-04 12:45:36Z

I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:

 1. Create a suffix tree from the string, let it be `T`.
 2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
 3. For each node `n` in `S`, do the following:
     3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
     3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`

Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.

This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.

asked	4 years, 4 months ago
viewed	946 times
active	4 years, 4 months ago

current community

your communities

more stack exchange communities

Find repeating substring of Length N [closed]

closed as too broad by VMAtm, Maroun, SMA, default locale, CRABOLO Jan 27 '15 at 5:08

2 Answers 2

Not the answer you're looking for? Browse other questions tagged java algorithm substring or ask your own question.

Linked

Hot Network Questions

closed as too broad by VMAtm, Maroun, SMA, default locale, CRABOLO Jan 27 '15 at 5:08

2 Answers 2

Not the answer you're looking for? Browse other questions tagged java algorithm substring or ask your own question.

Linked

Related