MS Office Forum / Word / Programming / July 2007
Fast way of comparing strings?
|
|
Thread rating:  |
Guillermo López-Anglada - 15 Jul 2007 23:13 GMT Hi,
I have the following code to compare strings:
Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _ ReferenceString As String)
Dim nWord As Long Dim WordsFound As Long
For nWord = 0 To UBound(ArrayToCompare())
'palabra contenida en cadena... If InStr(ReferenceString, ArrayToCompare(nWord)) > 0 Then
'aumentamos cómputo de palabras contenidas... WordsFound = WordsFound + 1 End If 'siguiente palabra... Next
GetNumOfMatchingWords = WordsFound End Function
I basically convert a string into an "array of words" and then check whether each of these words is contained within another string. Then I perform the same operation the other way around in order to find the highest percentage of matching words between the two strings. This approach works for my purpose, but as I have thousands of strings to compare, it isn't very efficient. Could someone suggest a faster method?
The whole process involves reading a text file (in binary mode), splitting the read buffer into lines, then the strings into words and then calling this routine. I think this one is the bottleneck. I've tried the Like operator as well, but it isn't flexible enough.
Regards,
Guillermo
Klaus Linke - 16 Jul 2007 00:37 GMT Hi Guillermo,
Not sure...
Do you compare every string to every other string? Are the strings you compare and the reference strings from the same list (text file)?
Maybe you could put all words from all strings into a two-dimensional array: The word itself in the first column, and a number indicating which string it's from, "1", "2"..., in the second.
Use a (fast) sorting algorithm.
Now you have a list like Aardvark 7 Aberdeen 3 Aberdeen 5 Aberdeen 12 and 2 and 5 and 7 ...
Now you could build a two-dimensional integer matrix (both dimensions equal to the number of strings) by running through this list once, and incrementing every time a word appears in two strings: Aardvark: can be ignored (since it only appears in one sentence), Aberdeen: increment matrix(3,5), matrix(3,12), matrix(5,12) and: increment matrix(2,5), matrix(2,7), matrix(5,7) ...
When you're finished, matrix elements with a high number indicate a pair of sentences that has a lot of words in common.
Regards, Klaus
> Hi, > [quoted text clipped - 35 lines] > > Guillermo Tony Strazzeri - 16 Jul 2007 03:14 GMT On Jul 16, 8:13 am, Guillermo López-Anglada <guille...@deletethislopez- anglada.com> wrote:
> Hi, In addition to Klaus' reply you could look at another approach.
When I need to check an item against a large list I don't like iterating through the list to find if the item exists. I have developed an approach using either a listbox control or a collection object. Both of these allow you to add only unique items to their lists. If you try to add an item that already exists they generate an error. My technique involves testing for the error to determine if the item exists. I think (expect/hope) this may be faster than iterating an array but have not tested it.
Maybe you could use this method. I can't post example code at this time because I am not at a PC where I have the code but I will do so when I can.
Hope this helps.
Cheers TonyS.
> I have the following code to compare strings: > [quoted text clipped - 30 lines] > > Guillermo Tony Strazzeri - 17 Jul 2007 10:47 GMT Here is the explanation and sample code to use a collection to determine is an item exists in a list.
I wrote the notes below as aprt of my operations and coding practices. This is the first time they will have been read by someone else. I hope they make sense..
VBA does not have a method for quickly determining if a item is a member of a list. For example in an array of names does the array contain the name fred. Conversely determine if fred is not a member of the array of names. The usual way to determine this is to iterate through the entire array until the item is found (or not found). Although tedious this is a quick way to check a small number of items to see if they are part of the list. A good technique that I have developed is to use a collection to achieve this.
============================ Collection Object A Collection object is an ordered set of items that can be referred to as a unit.
Remarks The Collection object provides a convenient way to refer to a related group of items as a single object. The items, or members, in a collection need only be related by the fact that they exist in the collection. Members of a collection don't have to share the same data type. A collection can be created the same way other objects are created. For example: Dim objCollection As New Collection
Once a collection is created, members can be added using the Add method and removed using the Remove method. Specific members can be returned from the collection using the Item method, while the entire collection can be iterated using the For Each...Next statement. ===================================
When adding items you specify the item value and a key for the item. I usually use the same value for the key as for the item. If the item is a long string the may be wasteful of memory in which case alternatives should be sought Attempting to add an item with the same key as an existing item key generates a' runtime error 457'. The error can be trapped or simply ignored. To ignore the error simply wrap the add with an error trap. Eg.
On Error Resume Next objCollection.Add Item, key On Error GoTo 0
This will avoid adding the duplicate leaving the original item in the collection. If we want to update an existing value (where we don't know if the item is in the list we can use a similar technique but this time attempting to remove the item then adding the new item.
On error resume next objCollection.remove "key" On error goto 0 objCollection.Add Item, key
The following example will determine if the item is a member of the list. If the item exists the variable temp will contain its value.
temp = "" On Error Resume Next temp = objCollection.Item(key) On Error GoTo 0
Things to be aware of. · Item numbers and item count are dynamic. · If an item is removed all subsequent items will shift up and the item count will reduce by 1. · Items can be removed either by key or item number
Since you can have a collection of virtually anything you can address/ store multiple data properties of the item by storing an object in the collection Limitation You can't return an item's index without iterating the list.
METHOD USING A LISTBOX IN A USERFORM =================================== If programming in a userform we can use the properties of a ListBox (which can remain hidden) to more easily return the index of an item if it exists in the list. Although we can use the same technique as for a collection (above). The listbox has the advantage that it can easily return the item's index in the list. Using a listbox called lbxCTItem
With lbxCTItem .ListIndex = -1 'unselects any selected item On Error Resume Next lbxCTItem = str On Error GoTo 0 thisItem = .ListIndex End With
'If the item does not exist thisItem retains a value of -1 'Otherwise it returns the items index number
If thisItem < 0 Then Msgbox "The item does not exist in the list." End If
You can store additional data about the item as columns of the listbox. By changing the listcolumn number you can even find an item using that column. But beware of duplicates.
I Would be interested in hearing comments on this technique. Maybe some of the more savvy developers can comment/test on whether this is quicker than other methods. If not I will get around to it one day. Thill then I suspec that even with several hundred items in a list most methods will be equally effective in this environment (Document automation as distinct from number crunching).
Cheers TonyS.
Russ - 16 Jul 2007 07:34 GMT What your ultimate goal? Or is it to find the words or the number of words that are common to two lists? Or is it to find a total word count for each word? Or is it to find which group of words consistently appear more often in other groups of word? (Then use what I say below.)
In your code below you are counting the number of words that are in common with both. If you reverse the process, like you say you do between the same two lists, then you should come up with the same number of common words.
Looking at your code below, change 'If InStr' to 'While InStr'...<> 0 if you want to count every match in the string. Then if you reverse the process, you could come up with a different number of matches, if the lists are not unique and allow the same word to repeat.
> Hi, > [quoted text clipped - 38 lines] > > Guillermo
 Signature Russ
drsmN0SPAMikleAThotmailD0Tcom.INVALID
Russ - 16 Jul 2007 11:35 GMT Guillermo, Message below in text.
> What your ultimate goal? > Or is it to find the words or the number of words that are common to two [quoted text clipped - 12 lines] > you could come up with a different number of matches, if the lists are not > unique and allow the same word to repeat. You would do the While loop like this:
Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _ ReferenceString As String)
Dim nWord As Long Dim WordsFound As Long
nWord = 1
nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord)) While nWord <> 0 'aumentamos cómputo de palabras contenidas... WordsFound = WordsFound + 1 nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord)) Wend
GetNumOfMatchingWords = WordsFound End Function
>> Hi, >> [quoted text clipped - 40 lines] >> >> Guillermo
 Signature Russ
drsmN0SPAMikleAThotmailD0Tcom.INVALID
Russ - 16 Jul 2007 11:50 GMT Guillermo, Use this instead. To avoid an endless loop, I have to increment nWord, too.
Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _ ReferenceString As String)
Dim nWord As Long Dim WordsFound As Long
nWord = 1
nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord)) While nWord <> 0 nWord = nWord + 1 'aumentamos cómputo de palabras contenidas... WordsFound = WordsFound + 1 nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord)) Wend
GetNumOfMatchingWords = WordsFound End Function
> Guillermo, > Message below in text. [quoted text clipped - 80 lines] >>> >>> Guillermo
 Signature Russ
drsmN0SPAMikleAThotmailD0Tcom.INVALID
Russ - 16 Jul 2007 18:08 GMT Guillermo, I wasn't going through ArrayToCompare, so use this instead.
Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _ ReferenceString As String) Dim iCounter As Long Dim nWord As Long Dim WordsFound As Long
iCounter = 1 For nWord = 0 To UBound(ArrayToCompare()) iCounter = InStr( iCounter, ReferenceString, ArrayToCompare(nWord)) While iCounter <> 0 'aumentamos cómputo de palabras contenidas... WordsFound = WordsFound + 1 iCounter = InStr( iCounter + 1, ReferenceString, _ ArrayToCompare(nWord)) Wend Next nWord GetNumOfMatchingWords = WordsFound End Function
> Guillermo, > Use this instead. To avoid an endless loop, I have to increment nWord, too. [quoted text clipped - 104 lines] >>>> >>>> Guillermo
 Signature Russ
drsmN0SPAMikleAThotmailD0Tcom.INVALID
Russ - 16 Jul 2007 18:54 GMT Guillermo, Sorry, I'm not testing this function. But as I look at it again, I see that I needed to move the iCounter inside the For loop.
Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _ ReferenceString As String) Dim iCounter As Long Dim nWord As Long Dim WordsFound As Long
For nWord = 0 To UBound(ArrayToCompare()) iCounter = 1 iCounter = InStr( iCounter, ReferenceString, ArrayToCompare(nWord)) While iCounter <> 0 'aumentamos cómputo de palabras contenidas... WordsFound = WordsFound + 1 iCounter = InStr( iCounter + 1, ReferenceString, _ ArrayToCompare(nWord)) Wend Next nWord GetNumOfMatchingWords = WordsFound End Function
> Guillermo, > I wasn't going through ArrayToCompare, so use this instead. [quoted text clipped - 128 lines] >>>>> >>>>> Guillermo
 Signature Russ
drsmN0SPAMikleAThotmailD0Tcom.INVALID
|
|
|