Hey, Jezebel, you're awesome! Thanks for the speedy reply, and the nifty
solution. The search is easily a hundred times faster just from transferring
the wordlist to an array and then searching on that... But, now, I've got
another problem because loading the array is super slow. I know there must
be a nice way to cut a string up and assign it to an array, but I can't find
it in the help docs. Is there anything analagous to split() in VB which is
also cross-platform? On my mac VB doesn't seem to understand split().
You are right that it's ugly to have to weed through the punctuation and the
spacing when checking for words. Not knowing any better, I wrote a routine
to check each word and to ignore it if it's punctuation. It's messy, (and
slow!), but it works. Is there a standard way to ignore punctuation and
spacing? In terms of inflectional endings and such, I combed the internet
for a while and found a cool site with a pretty good set of American English
dictionaries. The one I'm using had just about all the inflectional endings
and past participles I could think of -- and many more. Here is a link in
case anyone else is looking to do something with wordlists in the future...
http://wordlist.sourceforge.net/
Thanks for your help!
--Murphy
Jezebel - 20 Nov 2006 07:34 GMT
One approach to managing your wordlistis to store it in Excel rather than
word. Then you can read the Excel vector directly into an array --
Dim pxlApp As Excel.Application
Dim pxlBook As Excel.Workbook
Dim pxlSheet As Excel.Worksheet
Dim pData() As Variant
Set pxlApp = Excel.Application
Set pxlBook = pxlApp.Workbooks.Open(FileName:="c:\...\Book1.xls")
Set pxlSheet = pxlBook.Worksheets("Sheet1")
pData = pxlSheet.Range("A1:A15000")
Note that pData always comes back as a two-dimensional array, even if one of
the dimensions has a range of one (eg in this case, pData(1, 1) to
pData(15000,1).
As for the inflected word forms: you could side-step a lot of the problems
if, instead of taking the words in your document and checking if they are in
your wordlist, you do it the other way round: iterate your wordlist, and
check if they are in the document. Then, in many (but obviously not all)
cases, you need only look for word stems.
Eg, if the various forms of 'redundant' are proscribed, searching for
'redund' will get a match on 'redunant', 'redundancy', 'redundancies', etc.
It might also be worth looking at some of the work that's been done on text
tagging and text mining (eg see
http://itre.cis.upenn.edu/~myl/languagelog/archives/003753.html and the
links therein). Academics are often very generous with their software; so
you might find that this task has already been dealt with.
> Hey, Jezebel, you're awesome! Thanks for the speedy reply, and the nifty
> solution. The search is easily a hundred times faster just from
[quoted text clipped - 27 lines]
> Thanks for your help!
> --Murphy