Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Word / Programming / March 2007

Tip: Looking for answers? Try searching our database.

KillDuplicateParas

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Greg Maxey - 09 Mar 2007 18:04 GMT
I came across an old post in Google groups for deleted duplicated
lines of text in a document.

It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.

It had two If ... End If blocks.  The first check the para range
length.  If = then the second performed a text comparison.  I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.

The procedure worked as advertised, however with a longer document it
took a long time.

I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph.  Down from 200 seconds to 75 seconds!

Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!

Anyway, I just wanted to share this with the group:

Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
 TrkStatus = .TrackRevisions
 .TrackRevisions = False
End With
With Application
 SBar = .DisplayStatusBar
 .DisplayStatusBar = True
 .ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
 '*** Stet out first if block to delete duplicated empty paragraphs.
 If Len(oParRef.Range.Text) > 1 Then
   Do
     'An empty last paragraph may throw an error on the last loop.
     On Error GoTo Err_Exit
     If oParRef.Range = oParChk.Range Then
       oParChk.Range.Delete
     Else
       Set oParChk = oParChk.Next
     End If
   Loop Until oParChk Is Nothing
 End If '***
 Set oParRef = oParRef.Next
 'Skip errors.
 On Error Resume Next
 Set oParChk = oParRef.Next
 On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
 .StatusBar = False
 .DisplayStatusBar = SBar
 .ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub
Greg Maxey - 09 Mar 2007 18:24 GMT
Oops.  I forgot to mention the method used one of the For loops to
update the status bar.

Add:
 'Application.StatusBar = ActiveDocument.Paragraphs.Count & "
paragraphs to check. "

Just after the On Error GoTo 0 line.

Also all my test where done with only two or three different
paragraphs repeated many times.  As the number of different paragraphs
increases (in any method I suppose) the time to process will increase.

> I came across an old post in Google groups for deleted duplicated
> lines of text in a document.
[quoted text clipped - 70 lines]
> & " seconds."
> End Sub
Helmut Weber - 09 Mar 2007 19:23 GMT
Hi Submariner,

how about this one:

Sub Makro6x()
Dim t As Single
t = Timer
Dim prg1 As Paragraph
Dim prg2 As Paragraph
For Each prg1 In ActiveDocument.Range.Paragraphs
  For Each prg2 In ActiveDocument.Range.Paragraphs
     If prg1.Range.Text = prg2.Range.Text Then
        If prg1.Range.start <> prg2.Range.start Then
           prg2.Range.Delete
        End If
     End If
  Next
Next
MsgBox Timer - t
End Sub

800 paragraphs of kind rand(1,10)
104 pages
3.4 seconds
10 paragraphs left over.

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Greg Maxey - 09 Mar 2007 20:40 GMT
Helmet,

That is clear code and fast, but not as fast as 2.714 seconds (maybe
my processors is faster) ;-)

Here is the real difference.  Make each of those 800 paragraphs
slighgly different:

Sub ScratchMacro()
Dim i As Long
For i = 1 To ActiveDocument.Paragraphs.Count
 ActiveDocument.Range.Paragraphs(i).Range.Characters.First = i
Next i
End Sub

Then delete the numbers from the last two.  We now have 799 different
paragraphs and 2 duplicates.  Run both codes again:

Your version:  123 seconds
My version: 69 seconds

Both are lightening fast when all the paragraphs are duplicates,
because that first loop only runs  once.

I think the speed efficiency in the .Next method is due to the
processor doesn't have to a keep track of the paragraph count.  ???

Cheers

> Hi Submariner,
>
[quoted text clipped - 29 lines]
> Win XP, Office 2003
> "red.sys" & Chr$(64) & "t-online.de"
Helmut Weber - 10 Mar 2007 08:47 GMT
Hi Greg,

I see now that you are talking about the case
that a paragraph is immediatly followed by a duplicate,
whereas my code was meant to remove duplicate paragraphs
wherever they appear.

For removing empty paragraphs from the doc's end,
I use this code:

While ActiveDocument.Characters.Last.Previous = Chr(13)
  ActiveDocument.Characters.Last.Delete
Wend

Cheers

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

macropod - 10 Mar 2007 10:21 GMT
Hi Greg,

Looks like a derivative of something I developed & posted. My full version is:

Dim SBar As Boolean           ' Status Bar flag
Dim TrkStatus As Boolean      ' Track Changes flag

Sub KillDuplicateParas()
Call MacroEntry
Dim i As Long, j As Long
Dim eTime As Single
eTime = Timer
With ActiveDocument
   If .Paragraphs.Count > 1 Then
       ' Loop backwards to preserve paragraph count & indexing.
       ' Start at 2nd-last paragraph.
       For i = .Paragraphs.Count - 1 To 1 Step -1
           ' Ignore empty paragraphs
           If Len(.Paragraphs(i).Range.Text) > 1 Then
               ' Loop backwards to preserve paragraph count & indexing.
               ' Stop at last preceding paragraph.
               For j = .Paragraphs.Count To i + 1 Step -1
                   ' Report progress on Status Bar.
                   Application.StatusBar = i & " paragraphs to check. "
                   ' No point in checking paragraphs of unequal length.
                   If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then
                       ' Test strings of paragraphs of equal length.
                       If .Paragraphs(i).Range = .Paragraphs(j).Range Then
                           ' Delete duplicate paragraph.
                           .Paragraphs(j).Range.Delete
                       End If
                   End If
               Next
           End If
       Next
   End If
End With
' Report time taken. Elapsed time calculation allows for execution to extend past midnight.
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds."
Call MacroExit
End Sub

Private Sub MacroEntry()
' Store current Status Bar status, then switch on
SBar = Application.DisplayStatusBar
Application.DisplayStatusBar = True
' Store current Track Changes status, then switch off
With ActiveDocument
   TrkStatus = .TrackRevisions
   .TrackRevisions = False
End With
' Turn Off Screen Updating
Application.ScreenUpdating = False
End Sub

Private Sub MacroExit()
' Clear the Status Bar
Application.StatusBar = False
' Restore original Status Bar status
Application.DisplayStatusBar = SBar
' Restore original Track Changes status
ActiveDocument.TrackRevisions = TrkStatus
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub

I think they key difference is that my code checks all paras against each other, whereas yours only checks adjacent paras. I might
incorporate Helmut's revisions, though, since they seem to speed things up a bit.

Cheers

Signature

macropod
[MVP - Microsoft Word]
-------------------------

>I came across an old post in Google groups for deleted duplicated
> lines of text in a document.
[quoted text clipped - 70 lines]
> & " seconds."
> End Sub
Helmut Weber - 10 Mar 2007 12:10 GMT
Hi,

hmm...
maybe it is the structure of the data,
but this takes minutes or maybe will run endlessly,
though I can't see a reason for that:

Sub KillDuplicateParas()
Dim i As Long, j As Long
Dim eTime As Single
eTime = Timer
With ActiveDocument
  For i = .Paragraphs.Count - 1 To 1 Step -1
     For j = .Paragraphs.Count To i + 1 Step -1
        If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then
           If .Paragraphs(i).Range = .Paragraphs(j).Range Then
              .Paragraphs(j).Range.Delete
           End If
        End If
     Next
  Next
End With
MsgBox Timer - eTime

End Sub

800 Paragraphs, 4 unique paragraphs.
paragraph length about 460 characters.

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Helmut Weber - 10 Mar 2007 12:51 GMT
Hi,

this one takes 1.15 seconds, here and now,
under the above mentioned conditions:

Sub Makro6xx()
Dim prg1 As Paragraph
Dim t As Single
t = Timer
For Each prg1 In ActiveDocument.Range.Paragraphs
  If Not prg1.Next Is Nothing Then
     If prg1.Range.Text = prg1.Next.Range.Text Then
        prg1.Next.Range.Delete
     End If
  End If
Next
MsgBox Timer - t
End Sub

However, IMHO, like other alternative solutions,
except from comparing each paragraph to each other paragraph,
see above, too, it does not take care of the fact that deleting
one of two immediatly adjacent paragraphs may
result again in two adjacent equal paragraphs.

Happy thinking!

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Greg Maxey - 10 Mar 2007 14:45 GMT
Hi macropod.

It is a derivative of your code and there is a point of checking paragraphs
of unequal  length because it is faster than first checking all paragraphs
to see if they are of equal length ;-).

< I think they key difference is that my code checks all paras against each
other, whereas yours only checks adjacent paras. I might
<incorporate Helmut's revisions, though, since they seem to speed things up
a bit.

I don't know your test results but the code I posted has nothing to do with
adjacent pararapraphs.  You can take those 799 unique paragraphs and add 10,
20 or a 100 duplicates anywhere in the mix, run the code and the duplicates
are removed.

All that said, there appears to be more to this that I don't understand.  At
work yesterday with the The quick brown  fox example Helmut gave, my code
was 69 seconds compared to Helmut's 123 seconds.  Today at home with the
much longer Word2007 =(rand) text the processing 800 paragraphs takes my
method 272 seconds and Helmut's 236.

Perhaps there is no best way ;-)

Signature

Greg Maxey/Word MVP
See:
http://gregmaxey.mvps.org/word_tips.htm
For some helpful tips using Word.

> Hi Greg,
>
[quoted text clipped - 146 lines]
>> & " seconds."
>> End Sub
Klaus Linke - 11 Mar 2007 17:55 GMT
I do that with a wildcard replacement...

Find what: ([!^13]@^13){2,}
Replace with: \1

It's not perfect -- Say

abcd
cd
cd

would be replaced with

abcd

But usually, I just risk that. And it's fast <g>

Klaus

>I came across an old post in Google groups for deleted duplicated
> lines of text in a document.
[quoted text clipped - 70 lines]
> & " seconds."
> End Sub
Greg Maxey - 12 Mar 2007 11:23 GMT
Klaus,

No argument that that is the fastest method for eliminating "adjacent"
ducplicate paragraphs.  However run that code with:

The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown  fox jumped over 2 the lazy dogs.
The quick and extremely agile brown  fox jumped over 3 the lazy dogs.
The quick and extremely agile brown  fox jumped over 4 the lazy dogs.
The quick and extremely agile brown  fox jumped over 5 the lazy dogs.
The quick brown fox jumped over the lazy dog.

You are left with:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown  fox jumped over 2 lazy dogs.
The quick and extremely agile brown  fox jumped over 3 lazy dogs.
The quick and extremely agile brown  fox jumped over 4 lazy dogs.
The quick and extremely agile brown  fox jumped over 5 lazy dogs.
The quick brown fox jumped over the lazy dog.

Where my desired result is:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown  fox jumped over 2 lazy dogs.
The quick and extremely agile brown  fox jumped over 3 lazy dogs.
The quick and extremely agile brown  fox jumped over 4 lazy dogs.
The quick and extremely agile brown  fox jumped over 5 lazy dogs.

> I do that with a wildcard replacement...
>
[quoted text clipped - 91 lines]
>
> - Show quoted text -
Klaus Linke - 12 Mar 2007 19:45 GMT
Hi Greg,

Hadn't noticed that. What you could do is autonumber the paragraphs, turn
the numbering into hard text.
Selection.Range.ListFormat.ConvertNumbersToText

Now each paragraph has a number followed by a tab followed by the old text.

Then sort by fields (with the tab as separator, sorting by the second
field).
Then remove the duplicates with a wildcard search:
Find what: ([0-9]@^t)([!^13]@^13)([0-9]@^t)\2
Replace with: \1\2
(Repeat until nothing more is found)

Sort by the first field (numbers) to get back the old sequence.

Then remove the numbers and tabs with a wildcard search,
Find what: (^13)[0-9]@^t
Replace with: \1

Regards,
Klaus

> Klaus,
>
[quoted text clipped - 120 lines]
>>
>> - Show quoted text -
Greg Maxey - 12 Mar 2007 21:00 GMT
Klaus,

The method looks plausble, but I couldn't work out the code :-(

I used

 oRng.ListFormat.DefualtNumberFormat

to apply the numbering.  This adss a "period" between the number and
your the tab.

What did you have in mind when you said "autonumber?"

> Hi Greg,
>
[quoted text clipped - 146 lines]
>
> - Show quoted text -
Klaus Linke - 12 Mar 2007 21:15 GMT
> What did you have in mind when you said "autonumber?"

Something like this:
  Dim myLT As ListTemplate
  Set myLT = ActiveDocument.ListTemplates.Add(OutlineNumbered:=False,
Name:="Test")
  With myLT.ListLevels(1)
     .NumberFormat = "%1"
     .TrailingCharacter = wdTrailingTab
     .NumberStyle = wdListNumberStyleArabic
     .StartAt = 1
     .LinkedStyle = ""
  End With
  Selection.WholeStory
  Selection.Range.ListFormat.ApplyListTemplate ListTemplate:=myLT

But it isn't as fast as I thought... see last reply.

Klaus
Klaus Linke - 12 Mar 2007 21:08 GMT
Never mind :-(

It's actually slower than code (Makro6x) that's already been posted.

If there are thousands of paragraphs, it might pay off to get more
sophisticated.

Say, to read the whole document into a string, operate on that, and then
delete the duplicate paragraphs found.
If there aren't many duplicate paragraphs, that should be faster:

Sub Makro6n()
  Dim t As Single
  t = Timer
  Dim prg1 As Paragraph
  Dim vText As Variant
  vText = ActiveDocument.Content.Text
  vText = Split(vText, vbCr)
  Dim i As Long, j As Long
  For i = LBound(vText) To UBound(vText)
     For j = LBound(vText) To UBound(vText)
        If vText(i) = vText(j) And i <> j Then
           vText(j) = "<delete>" & STR(j)
        End If
     Next j
  Next i
  MsgBox Timer - t
  For i = UBound(vText) To LBound(vText) Step -1
     If vText(i) = "<delete>" & STR(i) Then
        ActiveDocument.Paragraphs(i + 1).Range.Delete
     End If
  Next i
End Sub

Or to avoid the double loop (comparing each paragraph with every other,
which takes an amount of time proportional to the square of the number of
paragraphs), it might pay off to use an efficient sorting algorithm (which
takes a time proportional to N log N, or even to N), then remove doubles
(proportional to N), then sort back.

Klaus
Klaus Linke - 12 Mar 2007 21:47 GMT
A bit better optimized... (Can't believe I looped all paragraphs in the
inner loop <g>)

Klaus

Sub Makro6n2()
  Dim t As Single, t1 As Single
  t = Timer
  Dim prg1 As Paragraph
  Dim vText As Variant
  Dim delFrom As Long, delTo As Long
  vText = ActiveDocument.Content.Text
  vText = Split(vText, vbCr)
  Dim i As Long, j As Long
  For i = LBound(vText) To UBound(vText)
     For j = i + 1 To UBound(vText)
        If vText(i) = vText(j) Then
           vText(j) = STR(j)
        End If
     Next j
  Next i
  t1 = Timer - t
  delFrom = 0
  For i = UBound(vText) To LBound(vText) Step -1
     If vText(i) = STR(i) Then
        delFrom = i + 1
        If delTo = 0 Then
           delTo = i + 1
        End If
     Else
        If delTo <> 0 Then
           ActiveDocument.Range( _
           ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
           ActiveDocument.Paragraphs(delTo).Range.End - 1 _
           ).Delete
        End If
        delFrom = 0
        delTo = 0
     End If
  Next i
  MsgBox Timer - t, , t1
End Sub
Greg Maxey - 12 Mar 2007 23:15 GMT
Klaus,

Looks like you hit the home run.  Your method is fast as lightening (less
than a second) for all tests using 800 paragraphs.  All the same leaving 1,
All difference nothing deleted, and several tests with a mixed bag.

Excellent work!

Helmut take notice.  A master is in our midst ;-)

Signature

Greg Maxey/Word MVP
See:
http://gregmaxey.mvps.org/word_tips.htm
For some helpful tips using Word.

> A bit better optimized... (Can't believe I looped all paragraphs in
> the inner loop <g>)
[quoted text clipped - 38 lines]
>   MsgBox Timer - t, , t1
> End Sub
Klaus Linke - 13 Mar 2007 00:16 GMT
Thanks!! An error I built in:

Replace

           ActiveDocument.Range( _
           ActiveDocument.Paragraphs(delFrom).Range.Start -1, _
           ActiveDocument.Paragraphs(delTo).Range.End - 1 _
           ).Delete

with
           ActiveDocument.Range( _
           ActiveDocument.Paragraphs(delFrom).Range.Start, _
           ActiveDocument.Paragraphs(delTo).Range.End _
           ).Delete

I thought deleting the last paragraph mark might cause problems (... it
doesn't), and my "work-around" could change the paragraph style.

Klaus
Helmut Weber - 13 Mar 2007 13:39 GMT
Hi all,

hmm...

Great job, Klaus!

Let me explain to my excuse,
I assumed there might be some not mentioned
bordering questions lurking behind, like:
"Excellent, but now I want to enclude some formatting conditions".

Which is often the case with posters, of course other than Greg.

Regardles of formatting, regardless of Word at all,
it might still be faster not to delete Word-ranges at all,
but to process an array of strings, delimited by chr(13),
representing the doc's content, and writing it back.

Don't let me be misunderstood,
I'm just enjoying a good discussion.

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Klaus Linke - 13 Mar 2007 14:45 GMT
> "Excellent, but now I want to enclude some formatting conditions".
>
> Which is often the case with posters, of course other than Greg.

Should be easy: Replace
           ActiveDocument.Range( _
           ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
           ActiveDocument.Paragraphs(delTo).Range.End - 1 _
           ).Delete
with
           ActiveDocument.Range( _
           ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
           ActiveDocument.Paragraphs(delTo).Range.End - 1 _
           ).Font.Color = wdColorRed

You could also leave the .Delete, and turn on "Track Changes" before you run
the macro, so you can review (accept/reject) the deletions later.

> it might still be faster not to delete Word-ranges at all,
> but to process an array of strings, delimited by chr(13),
> representing the doc's content, and writing it back.

True... but I don't expect that to make a real big difference.
An improvement for very large files (1000, 10.000 or more paragraphs) can be
achieved by avoiding the nested loop, as I i mentioned earlier.
You'd need an efficient sorting algorithm (can be done in a single loop,
that is, in a time proportional to the number of paragraphs), and a way to
remember the old order (say by using a two-dimensional array where every
paragraph keeps its old index). After the paragraphs are sorted
alphabetically, duplicates can be deleted (or marked) in a single loop. Then
you restore the old order based on the index (...can also be done in a
single loop).

Regards,
Klaus
Greg Maxey - 13 Mar 2007 15:37 GMT
As you two have moved this discussion to the deep end of the pool I am
going to head for one of the deck chairs and passively observe :-)

> > "Excellent, but now I want to enclude some formatting conditions".
>
[quoted text clipped - 31 lines]
> Regards,
> Klaus
Klaus Linke - 13 Mar 2007 16:52 GMT
> As you two have moved this discussion to the deep end of the pool I am
> going to head for one of the deck chairs and passively observe :-)

;-)

I'm not sure whether I understood Helmut correctly.

If "include some formatting conditions" means that you'd want to delete
duplicates only if they have the same formatting, then my approach would not
work... unless you tag the formatting in the string array.
Say the paragraph style:
<Heading 1>Some text
<Normal>Some other text
<Heading 2>Some text
...

Klaus
Helmut Weber - 13 Mar 2007 21:51 GMT
Hi Klaus,

>I'm not sure whether I understood Helmut correctly.

English:
Ordinary posters often come up with additional requirements
which could need a totally different approach to a problem.
So I've tried to keep my solution open for later extensions.
Your solution is similar to a tailor made suit,
Greg's and mine may allow for later adjustments.

German:
Normalen Postern fällt öfters später noch was zusätzliches ein,
was einen ganz anderen Ansatz für ein Problem erfordern könnte.
So habe ich versucht, meine Lösung offen zu halten für Erweiterungen.
Deine Lösung ist einem Maßanzug ähnlich,
Gregs und meine erlauben wohl eher Anpassungen.  

Have a nice day.

Schönen Tag noch.

Signature

Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Klaus Linke - 14 Mar 2007 01:26 GMT
Hi Helmut,

Sure!

If you're going for speed, you just about always are talking about
trade-offs... "speed for memory", "speed for complexity" and so on.
A simple bubble sort is slow... but it's also easier to write, and needs
less memory, than a quick sort, or a radix sort.

In Word/VBA, it's often about avoiding calls to the Word Object Model. I
don't think my macro is by necessity harder to adapt than yours or Greg's,
but extensions (if they involve the object model) would likely wipe out most
of its speed advantage.

Regards,
Klaus

> Hi Klaus,
>
[quoted text clipped - 17 lines]
>
> Schönen Tag noch.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.