MS Office Forum / Word / Programming / March 2007
KillDuplicateParas
|
|
Thread rating:  |
Greg Maxey - 09 Mar 2007 18:04 GMT I came across an old post in Google groups for deleted duplicated lines of text in a document.
It used a For x = Count method to go through and check the range of one paragraph to the the range of every other paragraph and delete any duplicates.
It had two If ... End If blocks. The first check the para range length. If = then the second performed a text comparison. I assume the author thought that it would save time by doing a text comparison only on paras of equal length.
The procedure worked as advertised, however with a longer document it took a long time.
I created about 800 paragraphs and determined that it was actually much quicker to bypass the the first length check and just do a range comparison on every paragraph. Down from 200 seconds to 75 seconds!
Next I remembered an method that Jezebel showed me for stepping through items using the .Next (property or method I am never sure which). I adapted the code as follows and the time taken was down to 3 seconds!
Anyway, I just wanted to share this with the group:
Sub KillDuplicateParagraphs() Dim SBar As Boolean Dim TrkStatus As Boolean Dim eTime As Single Dim oParRef As Paragraph Dim oParChk As Paragraph eTime = Timer With ActiveDocument TrkStatus = .TrackRevisions .TrackRevisions = False End With With Application SBar = .DisplayStatusBar .DisplayStatusBar = True .ScreenUpdating = False End With Set oParRef = ActiveDocument.Range.Paragraphs(1) Set oParChk = oParRef.Next Do '*** Stet out first if block to delete duplicated empty paragraphs. If Len(oParRef.Range.Text) > 1 Then Do 'An empty last paragraph may throw an error on the last loop. On Error GoTo Err_Exit If oParRef.Range = oParChk.Range Then oParChk.Range.Delete Else Set oParChk = oParChk.Next End If Loop Until oParChk Is Nothing End If '*** Set oParRef = oParRef.Next 'Skip errors. On Error Resume Next Set oParChk = oParRef.Next On Error GoTo 0 Loop Until oParRef Is Nothing Err_Exit: With Application .StatusBar = False .DisplayStatusBar = SBar .ScreenUpdating = True End With ActiveDocument.TrackRevisions = TrkStatus MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds." End Sub
Greg Maxey - 09 Mar 2007 18:24 GMT Oops. I forgot to mention the method used one of the For loops to update the status bar.
Add: 'Application.StatusBar = ActiveDocument.Paragraphs.Count & " paragraphs to check. "
Just after the On Error GoTo 0 line.
Also all my test where done with only two or three different paragraphs repeated many times. As the number of different paragraphs increases (in any method I suppose) the time to process will increase.
> I came across an old post in Google groups for deleted duplicated > lines of text in a document. [quoted text clipped - 70 lines] > & " seconds." > End Sub Helmut Weber - 09 Mar 2007 19:23 GMT Hi Submariner,
how about this one:
Sub Makro6x() Dim t As Single t = Timer Dim prg1 As Paragraph Dim prg2 As Paragraph For Each prg1 In ActiveDocument.Range.Paragraphs For Each prg2 In ActiveDocument.Range.Paragraphs If prg1.Range.Text = prg2.Range.Text Then If prg1.Range.start <> prg2.Range.start Then prg2.Range.Delete End If End If Next Next MsgBox Timer - t End Sub
800 paragraphs of kind rand(1,10) 104 pages 3.4 seconds 10 paragraphs left over.
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Greg Maxey - 09 Mar 2007 20:40 GMT Helmet,
That is clear code and fast, but not as fast as 2.714 seconds (maybe my processors is faster) ;-)
Here is the real difference. Make each of those 800 paragraphs slighgly different:
Sub ScratchMacro() Dim i As Long For i = 1 To ActiveDocument.Paragraphs.Count ActiveDocument.Range.Paragraphs(i).Range.Characters.First = i Next i End Sub
Then delete the numbers from the last two. We now have 799 different paragraphs and 2 duplicates. Run both codes again:
Your version: 123 seconds My version: 69 seconds
Both are lightening fast when all the paragraphs are duplicates, because that first loop only runs once.
I think the speed efficiency in the .Next method is due to the processor doesn't have to a keep track of the paragraph count. ???
Cheers
> Hi Submariner, > [quoted text clipped - 29 lines] > Win XP, Office 2003 > "red.sys" & Chr$(64) & "t-online.de" Helmut Weber - 10 Mar 2007 08:47 GMT Hi Greg,
I see now that you are talking about the case that a paragraph is immediatly followed by a duplicate, whereas my code was meant to remove duplicate paragraphs wherever they appear.
For removing empty paragraphs from the doc's end, I use this code:
While ActiveDocument.Characters.Last.Previous = Chr(13) ActiveDocument.Characters.Last.Delete Wend
Cheers
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
macropod - 10 Mar 2007 10:21 GMT Hi Greg,
Looks like a derivative of something I developed & posted. My full version is:
Dim SBar As Boolean ' Status Bar flag Dim TrkStatus As Boolean ' Track Changes flag
Sub KillDuplicateParas() Call MacroEntry Dim i As Long, j As Long Dim eTime As Single eTime = Timer With ActiveDocument If .Paragraphs.Count > 1 Then ' Loop backwards to preserve paragraph count & indexing. ' Start at 2nd-last paragraph. For i = .Paragraphs.Count - 1 To 1 Step -1 ' Ignore empty paragraphs If Len(.Paragraphs(i).Range.Text) > 1 Then ' Loop backwards to preserve paragraph count & indexing. ' Stop at last preceding paragraph. For j = .Paragraphs.Count To i + 1 Step -1 ' Report progress on Status Bar. Application.StatusBar = i & " paragraphs to check. " ' No point in checking paragraphs of unequal length. If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then ' Test strings of paragraphs of equal length. If .Paragraphs(i).Range = .Paragraphs(j).Range Then ' Delete duplicate paragraph. .Paragraphs(j).Range.Delete End If End If Next End If Next End If End With ' Report time taken. Elapsed time calculation allows for execution to extend past midnight. MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds." Call MacroExit End Sub
Private Sub MacroEntry() ' Store current Status Bar status, then switch on SBar = Application.DisplayStatusBar Application.DisplayStatusBar = True ' Store current Track Changes status, then switch off With ActiveDocument TrkStatus = .TrackRevisions .TrackRevisions = False End With ' Turn Off Screen Updating Application.ScreenUpdating = False End Sub
Private Sub MacroExit() ' Clear the Status Bar Application.StatusBar = False ' Restore original Status Bar status Application.DisplayStatusBar = SBar ' Restore original Track Changes status ActiveDocument.TrackRevisions = TrkStatus ' Restore Screen Updating Application.ScreenUpdating = True End Sub
I think they key difference is that my code checks all paras against each other, whereas yours only checks adjacent paras. I might incorporate Helmut's revisions, though, since they seem to speed things up a bit.
Cheers
 Signature macropod [MVP - Microsoft Word] -------------------------
>I came across an old post in Google groups for deleted duplicated > lines of text in a document. [quoted text clipped - 70 lines] > & " seconds." > End Sub Helmut Weber - 10 Mar 2007 12:10 GMT Hi,
hmm... maybe it is the structure of the data, but this takes minutes or maybe will run endlessly, though I can't see a reason for that:
Sub KillDuplicateParas() Dim i As Long, j As Long Dim eTime As Single eTime = Timer With ActiveDocument For i = .Paragraphs.Count - 1 To 1 Step -1 For j = .Paragraphs.Count To i + 1 Step -1 If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then If .Paragraphs(i).Range = .Paragraphs(j).Range Then .Paragraphs(j).Range.Delete End If End If Next Next End With MsgBox Timer - eTime
End Sub
800 Paragraphs, 4 unique paragraphs. paragraph length about 460 characters.
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Helmut Weber - 10 Mar 2007 12:51 GMT Hi,
this one takes 1.15 seconds, here and now, under the above mentioned conditions:
Sub Makro6xx() Dim prg1 As Paragraph Dim t As Single t = Timer For Each prg1 In ActiveDocument.Range.Paragraphs If Not prg1.Next Is Nothing Then If prg1.Range.Text = prg1.Next.Range.Text Then prg1.Next.Range.Delete End If End If Next MsgBox Timer - t End Sub
However, IMHO, like other alternative solutions, except from comparing each paragraph to each other paragraph, see above, too, it does not take care of the fact that deleting one of two immediatly adjacent paragraphs may result again in two adjacent equal paragraphs.
Happy thinking!
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Greg Maxey - 10 Mar 2007 14:45 GMT Hi macropod.
It is a derivative of your code and there is a point of checking paragraphs of unequal length because it is faster than first checking all paragraphs to see if they are of equal length ;-).
< I think they key difference is that my code checks all paras against each other, whereas yours only checks adjacent paras. I might <incorporate Helmut's revisions, though, since they seem to speed things up a bit.
I don't know your test results but the code I posted has nothing to do with adjacent pararapraphs. You can take those 799 unique paragraphs and add 10, 20 or a 100 duplicates anywhere in the mix, run the code and the duplicates are removed.
All that said, there appears to be more to this that I don't understand. At work yesterday with the The quick brown fox example Helmut gave, my code was 69 seconds compared to Helmut's 123 seconds. Today at home with the much longer Word2007 =(rand) text the processing 800 paragraphs takes my method 272 seconds and Helmut's 236.
Perhaps there is no best way ;-)
 Signature Greg Maxey/Word MVP See: http://gregmaxey.mvps.org/word_tips.htm For some helpful tips using Word.
> Hi Greg, > [quoted text clipped - 146 lines] >> & " seconds." >> End Sub Klaus Linke - 11 Mar 2007 17:55 GMT I do that with a wildcard replacement...
Find what: ([!^13]@^13){2,} Replace with: \1
It's not perfect -- Say
abcd cd cd
would be replaced with
abcd
But usually, I just risk that. And it's fast <g>
Klaus
>I came across an old post in Google groups for deleted duplicated > lines of text in a document. [quoted text clipped - 70 lines] > & " seconds." > End Sub Greg Maxey - 12 Mar 2007 11:23 GMT Klaus,
No argument that that is the fastest method for eliminating "adjacent" ducplicate paragraphs. However run that code with:
The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The quick and extremely agile brown fox jumped over 2 the lazy dogs. The quick and extremely agile brown fox jumped over 3 the lazy dogs. The quick and extremely agile brown fox jumped over 4 the lazy dogs. The quick and extremely agile brown fox jumped over 5 the lazy dogs. The quick brown fox jumped over the lazy dog.
You are left with: The quick brown fox jumped over the lazy dog. The quick and extremely agile brown fox jumped over 2 lazy dogs. The quick and extremely agile brown fox jumped over 3 lazy dogs. The quick and extremely agile brown fox jumped over 4 lazy dogs. The quick and extremely agile brown fox jumped over 5 lazy dogs. The quick brown fox jumped over the lazy dog.
Where my desired result is: The quick brown fox jumped over the lazy dog. The quick and extremely agile brown fox jumped over 2 lazy dogs. The quick and extremely agile brown fox jumped over 3 lazy dogs. The quick and extremely agile brown fox jumped over 4 lazy dogs. The quick and extremely agile brown fox jumped over 5 lazy dogs.
> I do that with a wildcard replacement... > [quoted text clipped - 91 lines] > > - Show quoted text - Klaus Linke - 12 Mar 2007 19:45 GMT Hi Greg,
Hadn't noticed that. What you could do is autonumber the paragraphs, turn the numbering into hard text. Selection.Range.ListFormat.ConvertNumbersToText
Now each paragraph has a number followed by a tab followed by the old text.
Then sort by fields (with the tab as separator, sorting by the second field). Then remove the duplicates with a wildcard search: Find what: ([0-9]@^t)([!^13]@^13)([0-9]@^t)\2 Replace with: \1\2 (Repeat until nothing more is found)
Sort by the first field (numbers) to get back the old sequence.
Then remove the numbers and tabs with a wildcard search, Find what: (^13)[0-9]@^t Replace with: \1
Regards, Klaus
> Klaus, > [quoted text clipped - 120 lines] >> >> - Show quoted text - Greg Maxey - 12 Mar 2007 21:00 GMT Klaus,
The method looks plausble, but I couldn't work out the code :-(
I used
oRng.ListFormat.DefualtNumberFormat
to apply the numbering. This adss a "period" between the number and your the tab.
What did you have in mind when you said "autonumber?"
> Hi Greg, > [quoted text clipped - 146 lines] > > - Show quoted text - Klaus Linke - 12 Mar 2007 21:15 GMT > What did you have in mind when you said "autonumber?" Something like this: Dim myLT As ListTemplate Set myLT = ActiveDocument.ListTemplates.Add(OutlineNumbered:=False, Name:="Test") With myLT.ListLevels(1) .NumberFormat = "%1" .TrailingCharacter = wdTrailingTab .NumberStyle = wdListNumberStyleArabic .StartAt = 1 .LinkedStyle = "" End With Selection.WholeStory Selection.Range.ListFormat.ApplyListTemplate ListTemplate:=myLT
But it isn't as fast as I thought... see last reply.
Klaus
Klaus Linke - 12 Mar 2007 21:08 GMT Never mind :-(
It's actually slower than code (Makro6x) that's already been posted.
If there are thousands of paragraphs, it might pay off to get more sophisticated.
Say, to read the whole document into a string, operate on that, and then delete the duplicate paragraphs found. If there aren't many duplicate paragraphs, that should be faster:
Sub Makro6n() Dim t As Single t = Timer Dim prg1 As Paragraph Dim vText As Variant vText = ActiveDocument.Content.Text vText = Split(vText, vbCr) Dim i As Long, j As Long For i = LBound(vText) To UBound(vText) For j = LBound(vText) To UBound(vText) If vText(i) = vText(j) And i <> j Then vText(j) = "<delete>" & STR(j) End If Next j Next i MsgBox Timer - t For i = UBound(vText) To LBound(vText) Step -1 If vText(i) = "<delete>" & STR(i) Then ActiveDocument.Paragraphs(i + 1).Range.Delete End If Next i End Sub
Or to avoid the double loop (comparing each paragraph with every other, which takes an amount of time proportional to the square of the number of paragraphs), it might pay off to use an efficient sorting algorithm (which takes a time proportional to N log N, or even to N), then remove doubles (proportional to N), then sort back.
Klaus
Klaus Linke - 12 Mar 2007 21:47 GMT A bit better optimized... (Can't believe I looped all paragraphs in the inner loop <g>)
Klaus
Sub Makro6n2() Dim t As Single, t1 As Single t = Timer Dim prg1 As Paragraph Dim vText As Variant Dim delFrom As Long, delTo As Long vText = ActiveDocument.Content.Text vText = Split(vText, vbCr) Dim i As Long, j As Long For i = LBound(vText) To UBound(vText) For j = i + 1 To UBound(vText) If vText(i) = vText(j) Then vText(j) = STR(j) End If Next j Next i t1 = Timer - t delFrom = 0 For i = UBound(vText) To LBound(vText) Step -1 If vText(i) = STR(i) Then delFrom = i + 1 If delTo = 0 Then delTo = i + 1 End If Else If delTo <> 0 Then ActiveDocument.Range( _ ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _ ActiveDocument.Paragraphs(delTo).Range.End - 1 _ ).Delete End If delFrom = 0 delTo = 0 End If Next i MsgBox Timer - t, , t1 End Sub
Greg Maxey - 12 Mar 2007 23:15 GMT Klaus,
Looks like you hit the home run. Your method is fast as lightening (less than a second) for all tests using 800 paragraphs. All the same leaving 1, All difference nothing deleted, and several tests with a mixed bag.
Excellent work!
Helmut take notice. A master is in our midst ;-)
 Signature Greg Maxey/Word MVP See: http://gregmaxey.mvps.org/word_tips.htm For some helpful tips using Word.
> A bit better optimized... (Can't believe I looped all paragraphs in > the inner loop <g>) [quoted text clipped - 38 lines] > MsgBox Timer - t, , t1 > End Sub Klaus Linke - 13 Mar 2007 00:16 GMT Thanks!! An error I built in:
Replace
ActiveDocument.Range( _ ActiveDocument.Paragraphs(delFrom).Range.Start -1, _ ActiveDocument.Paragraphs(delTo).Range.End - 1 _ ).Delete
with ActiveDocument.Range( _ ActiveDocument.Paragraphs(delFrom).Range.Start, _ ActiveDocument.Paragraphs(delTo).Range.End _ ).Delete
I thought deleting the last paragraph mark might cause problems (... it doesn't), and my "work-around" could change the paragraph style.
Klaus
Helmut Weber - 13 Mar 2007 13:39 GMT Hi all,
hmm...
Great job, Klaus!
Let me explain to my excuse, I assumed there might be some not mentioned bordering questions lurking behind, like: "Excellent, but now I want to enclude some formatting conditions".
Which is often the case with posters, of course other than Greg.
Regardles of formatting, regardless of Word at all, it might still be faster not to delete Word-ranges at all, but to process an array of strings, delimited by chr(13), representing the doc's content, and writing it back.
Don't let me be misunderstood, I'm just enjoying a good discussion.
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Klaus Linke - 13 Mar 2007 14:45 GMT > "Excellent, but now I want to enclude some formatting conditions". > > Which is often the case with posters, of course other than Greg. Should be easy: Replace ActiveDocument.Range( _ ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _ ActiveDocument.Paragraphs(delTo).Range.End - 1 _ ).Delete with ActiveDocument.Range( _ ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _ ActiveDocument.Paragraphs(delTo).Range.End - 1 _ ).Font.Color = wdColorRed
You could also leave the .Delete, and turn on "Track Changes" before you run the macro, so you can review (accept/reject) the deletions later.
> it might still be faster not to delete Word-ranges at all, > but to process an array of strings, delimited by chr(13), > representing the doc's content, and writing it back. True... but I don't expect that to make a real big difference. An improvement for very large files (1000, 10.000 or more paragraphs) can be achieved by avoiding the nested loop, as I i mentioned earlier. You'd need an efficient sorting algorithm (can be done in a single loop, that is, in a time proportional to the number of paragraphs), and a way to remember the old order (say by using a two-dimensional array where every paragraph keeps its old index). After the paragraphs are sorted alphabetically, duplicates can be deleted (or marked) in a single loop. Then you restore the old order based on the index (...can also be done in a single loop).
Regards, Klaus
Greg Maxey - 13 Mar 2007 15:37 GMT As you two have moved this discussion to the deep end of the pool I am going to head for one of the deck chairs and passively observe :-)
> > "Excellent, but now I want to enclude some formatting conditions". > [quoted text clipped - 31 lines] > Regards, > Klaus Klaus Linke - 13 Mar 2007 16:52 GMT > As you two have moved this discussion to the deep end of the pool I am > going to head for one of the deck chairs and passively observe :-) ;-)
I'm not sure whether I understood Helmut correctly.
If "include some formatting conditions" means that you'd want to delete duplicates only if they have the same formatting, then my approach would not work... unless you tag the formatting in the string array. Say the paragraph style: <Heading 1>Some text <Normal>Some other text <Heading 2>Some text ...
Klaus
Helmut Weber - 13 Mar 2007 21:51 GMT Hi Klaus,
>I'm not sure whether I understood Helmut correctly. English: Ordinary posters often come up with additional requirements which could need a totally different approach to a problem. So I've tried to keep my solution open for later extensions. Your solution is similar to a tailor made suit, Greg's and mine may allow for later adjustments.
German: Normalen Postern fällt öfters später noch was zusätzliches ein, was einen ganz anderen Ansatz für ein Problem erfordern könnte. So habe ich versucht, meine Lösung offen zu halten für Erweiterungen. Deine Lösung ist einem Maßanzug ähnlich, Gregs und meine erlauben wohl eher Anpassungen.
Have a nice day.
Schönen Tag noch.
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Klaus Linke - 14 Mar 2007 01:26 GMT Hi Helmut,
Sure!
If you're going for speed, you just about always are talking about trade-offs... "speed for memory", "speed for complexity" and so on. A simple bubble sort is slow... but it's also easier to write, and needs less memory, than a quick sort, or a radix sort.
In Word/VBA, it's often about avoiding calls to the Word Object Model. I don't think my macro is by necessity harder to adapt than yours or Greg's, but extensions (if they involve the object model) would likely wipe out most of its speed advantage.
Regards, Klaus
> Hi Klaus, > [quoted text clipped - 17 lines] > > Schönen Tag noch.
|
|
|