MS Office Forum / Word / Programming / August 2007
How do I discover repeating text portions across text files?
|
|
Thread rating:  |
paddys - 29 Aug 2007 08:32 GMT Suppose I have a specific number of text [dot doc] files of specified size [say not more than 500 words], and have to discover if there are text portions [ie., a set of words or phrases or clauses or entire sentences] 'repeating' across these files. In other words, it is a 'search' for files, from a given set of text files, containing 'repeating text portions' across themselves. The challenge is to discover them intelligently even without any pre-specified 'text portions'. Simple 'find' mechanism is very cumbersome and tedious, especailly when you have to search a number of text files.
Helmut Weber - 29 Aug 2007 15:54 GMT Hi Paddys,
in my very humble opinion, there is no intelligent way, just a brute force approach.
If you want to know, which text parts appear where, you have to set up a list of all text parts, and check all files for all items in the list.
That is collect all words, phrases, clauses, sentences without duplicates in a list first and then go on searching.
I did something similar once for words. Google for "Corpus Linguistics".
One could of course when setting up the list remember from which file a new item comes from and exclude this file form searching for the item, but whether this exchange of simplicity for speed is worth the effort, I don't know.
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Richard Relpht - 30 Aug 2007 12:24 GMT I would create a database table (or a plain old CSV file or a tab separated file if you have semicolons in the data!) with three fields, named 1. "PhraseClauseWord" 2. "FromFileName" - and - 3. "Counter"
VBscript can do that with a text file. And if you can handle VBA, then you can handle VBScript.
So.... Then open each file and scan through it once looking for phrases, output (append) the results to the database, entering, for each phrase, the phrase, the FromFileName and a counter set at 1.
Then do the same for clauses, then the same for words, appending the data into the databse table.(I don't know how you would define a clause ...)
Do that for each file. When that's finished, you have everything in the database table.
Then you look at this data through an Excel pivot table. Using the external data option.
If File A contains : The cat sat on the mat. The dog sat on the cat.
and File B contains The rat ate the cat.
The your table will look like this
PhraseWord File Ctr The A 1 cat A 1 sat A 1 on A 1 the A 1 mat A 1 The A 1 dog A 1 sat A 1 on A 1 the A 1 cat A 1 The B 1 rat B 1 ate B 1 the B 1 cat B 1 The rat ate the cat. B 1 The cat sat on the mat. A 1 The dog sat on the cat. A 1
So your pivot table can look like this File PhraseWord Total A cat 2 dog 1 mat 1 on 2 sat 2 The 4 The cat sat on the mat. 1 The dog sat on the cat. 1 Total A 14 B ate 1 cat 1 rat 1 The 2 The rat ate the cat. 1 Total B 6 Total 20
or this PhraseWord A B Total ate 1 1 cat 2 1 3 dog 1 1 mat 1 1 on 2 2 rat 1 1 sat 2 2 The 4 2 6 The rat ate the cat. 1 1 The cat sat on the mat. 1 1 The dog sat on the cat. 1 1 Total 14 6 20
or this PhraseWord A B Total The 4 2 6 cat 2 1 3 on 2 2 sat 2 2 ate 1 1 The rat ate the cat. 1 1 rat 1 1 The cat sat on the mat. 1 1 mat 1 1 dog 1 1 The dog sat on the cat. 1 1 Total 14 6 20
which is the same thing, only sorted from most to least, so that words that only appear once (i.e. unique ocurences) are at the bottom of the table.
etc, etc. This will probably be illegible once posted due to plain text hassles in newsgroups but if you want a private mail, just ask in the newgroup.
HTH Richard.
paddys - 31 Aug 2007 06:16 GMT thanks, richard. 1. the opening and checking of each file manually could be tedious and also be prone to errors and omissions; i have some ideas about doing the whole thing as a 'batch' process. pl. read my post to weber.
would welcome any help in coding custom solution to be made part of my Word program. thanks.
paddys.
> I would create a database table (or a plain old CSV file or a tab separated > file if you have semicolons in the data!) with three fields, named [quoted text clipped - 112 lines] > HTH > Richard. paddys - 31 Aug 2007 06:10 GMT Thanks, Weber. I have an idea but am not sure how to create and validate the code, to be made part of word custom program or through command line program. Please advise if and how it could work.
1. First, accept the directory or folder containing the text files; assume [or validate] all of them are dot doc files. 2. Also prompt and accept the initial search string subject to limits, say 50 words; default could be the number of words contained in the first two lines identified by end-of-line mark of the first file. Store this into StringA. 3. Build an array or table of 3 dimensions, viz., name of file, number of words, and number of lines, by reading either from the files one by one, or by getting 'property values'; i have no clue as to how to do this part, but believe there must be a way out. call this Array1. Also create an empty Array2, of two dimensions, to contain names of 'original file' and 'copy file'. 4. Now, begin a two-level-nested loop process of ALL files; beginning with StringA of file1, compare with StringB [to be formed first as per default value, and then to be replaced by the next set] of file2; if matched, then file2 contains at least one portion repeated; note this into Array2, and exclude this from further comparison; like a typical sort procedure, while inner loop will compare every 'source string' of file1 with 'target strings' of every other file, [ranging from 2 ....n or less depending on matching occurred], the outer loop will build and complete Array2 containing answers to my search. 5. I know the core code works in simple Basic or Visual Foxpro, but do not know how to embed it as a feature into Word.
Could you help, please? Thanks again. Paddys
> Hi Paddys, > [quoted text clipped - 16 lines] > but whether this exchange of simplicity for speed > is worth the effort, I don't know. Helmut Weber - 31 Aug 2007 12:13 GMT Hi Paddys,
sorry, but this is asking for too much at once. Split it all up in several questions and ask for help of each in turn in the groups.
>1. First, accept the directory or folder containing the text files; assume >[or validate] all of them are dot doc files. As a start, see: http://word.mvps.org/faqs/macrosvba/BatchFR.htm http://www.gmayor.com/batch_replace.htm
Also, give us the word-version you talking about. Furthermore, ... it would be rather unusual to process dot-files that way. ... what do you mean by "Accept the directory" probably getting the name of the directory in your code somehow. Is the program just for you? You could type the name of the directory in an input box. You could use several controls which allow to pick a directory using the mouse. But you'll need a userform. ... What to do if not all files in a directory are doc-files?
To get a list of all docs in "c:\test\word" into a text-file, you may use in the command shell: c:\test\word\>dir *.doc /b > c:\dir.txt
There are other ways, but it may be about doc-files organized in several subdirectories. Then ...
Still, Word's definition of "word" and "sentence" is different from the fuzzy human concept of them.
"Clause" and "phrase" Word doesn't know at all.
Sorry, it ain't that easy.
...
 Signature Greetings from Bavaria, Germany
Helmut Weber, MVP WordVBA
Win XP, Office 2003 "red.sys" & Chr$(64) & "t-online.de"
Jeff Mathewson - 31 Aug 2007 15:05 GMT This wouldn't be to hard to do.
All you need to do is create an array of sentences example: arySentences(Sentence, counter). From there use a for loop to go through all the sentences (for each oSen in activedocument.Sentences). Any new sentences found, add to array. Any dup sentences add to the array counter.
That's just the basics, but once you play around with it, it shouldn't be that hard. I have such a macro that goes one level up by Words and collections the vocabulary of the document(s). So it can be done.
 Signature ------------------------------------------------------------ Job-O-Magic.com: The Master List of All Job Search Sites www.jobomagic.com
> Suppose I have a specific number of text [dot doc] files of specified size > [say not more than 500 words], and have to discover if there are text [quoted text clipped - 5 lines] > cumbersome and tedious, especailly when you have to search a number of text > files.
|
|
|