Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Word / Programming / January 2005

Tip: Looking for answers? Try searching our database.

Word (.doc) -> Text (.txt) conversion

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
robster278 - 06 Jan 2005 00:56 GMT
I'm running a large site which received many .doc files daily which
need to be converted into plain text, the purpose of which is to supply
data for the site's search engine which then points users to the
appropriate .doc files. The raw text is also needed to populate HTML
preview pages for the .doc files.

My question is simple - what is the easiest (preferably server side)
method for converting .doc files into raw text. I'm running Windows
server so I presume that .doc API and script commands would be fairly
easy to implement. If a server side solution is impossible then a
locally executed method could fit the bill too. It just needs to be
QUICK and AUTOMATED. I'm not going to copy and paste from MS Word into
Notepad for 10 hours every day.
Any suggestions kindly requested.

Rob Ponting
rob_ponting@hotmail.com
Helmut Weber - 06 Jan 2005 13:24 GMT
Hi Rob,
probably the only and at the same time the easiest way,
could be to open a doc and save it as txt.

Though there are some problems to take care of,
like handling the question whether you want to loose formatting.
And if the docs are rather complex,
the resulting txt-file will be a mess anyway, and
doing some editing by hand will be unavoidable.

All toghether, if it comes to long time automation,
I think you would need a programming language like VB
or some other, and should not try to use Word as an
automation server.

With VB, I am scanning directories in regular intervals,
check, whether there are any docs, start word, process the docs,
save them as txt to some other place), and remove all processed
files from that directory. In theory, this could run endlessly.
Though, in fact, 4 weeks without any crash, was the best I got so far,
and my docs are all very simple and very short.

Greetings from Bavaria, Germany
Helmut Weber, MVP
"red.sys" & chr(64) & "t-online.de"
Word XP, Win 98
http://word.mvps.org/
Dave Lett - 06 Jan 2005 13:44 GMT
Hi Rob,

Using vba, you could read all the files into an array as described in "How
to read the filenames of all the files in a directory into an array" at
http://word.mvps.org/faqs/macrosvba/ReadFilesIntoArray.htm and then use the
FileCopy statement to copy the .doc file as a .txt file (you can even change
the directory if you want). The FileCopy statement would take the form of

FileCopy source:="C:\Test\test.doc",
destination:="C:\Test\TextOnly\test.txt"

HTH,
Dave

> I'm running a large site which received many .doc files daily which
> need to be converted into plain text, the purpose of which is to supply
[quoted text clipped - 13 lines]
> Rob Ponting
> rob_ponting@hotmail.com
Chuck - 06 Jan 2005 16:59 GMT
Unfortunately just copying files and renaming them with new extensions
doesn't actually convert the files from Word document to Text format.  I've
tried the FileCopy code in Dave’s message using a short Word document
containing a footnote and on opening the resulting .txt file there's a lot of
Word code but not a lot of actual document text content.  Results might vary
depending on the presence or absence of anything other than simple text in
Word document (eg footnotes, paragraph numbering, shapes, etc) but without
some sort of conversion process the results are in any case likely to be
unreadable.

Helmut's suggestion (looping through directories, opening documents --
probably as objects -- then saving them in .txt format) is probably going to
work best, bearing in mind that saving in .txt format will lose all
footnotes, shapes etc.

Another method might be to loop through the documents, opening them, copying
the contents and then using paste unformatted to dump the text into new
documents (or text files) which might better preserve CrLfs as well as
paragraph numbering (paste unformatted removes auto paragraph numbering but
leaves the paragraph number itself as text).

Here’s a bit of code that might give you some ideas (the FileLocked sub that
is called from the DocToText sub is from the MVP site).  Please note that I
can't warrant this code free from bugs and since it contains a Kill command
it shouldn't be run on any live files unless you have safe and secure backup
copies.  Also, I don't know how well Word would handle hundreds or thousands
of iterations without a restart.

Sub DocToTxt()

Dim oWord As Object
Dim oOldDoc As Document
Dim oNewDoc As Document
Dim i As Long
Dim strInputFileName As String
Dim strOutputFileName As String
Dim strSourceDir As String
Dim strOutputDir As String
 
On Error GoTo errorhandler

Set oWord = CreateObject("Word.Application")

'set source and output locations
strSourceDir = "C:\temp\"
strOutputDir = "C:\temp\"

With oWord.Application.FileSearch
  .FileName = "*.doc"
  .LookIn = strSourceDir
  .Execute
  For i = 1 To .FoundFiles.Count
    If InStr(1, CStr(.FoundFiles(i)), "~") = 1 Then
      'do nothing, it's a temp/hidden file
    Else
      If Not FileLocked(.FoundFiles(i)) Then
        oWord.Documents.Open .FoundFiles(i)
        Set oOldDoc = oWord.Documents(.FoundFiles(i))
        Set oNewDoc = oWord.Documents.Add
        With oOldDoc
          strInputFileName = .Name
          .Content.Copy
          .Close savechanges:=wdDoNotSaveChanges
        End With
        With oNewDoc
          .Range.PasteSpecial datatype:=wdPasteText
          strOutputFileName = Left(strInputFileName, _
                   Len(strInputFileName) - 4) & ".txt"
          .SaveAs FileName:=strOutputDir & _
                   strOutputFileName, _
                   fileformat:=wdFormatDOSTextLineBreaks
          .Close
        End With
        'delete source file - rem out the Kill line if you
        'don't want to delete it
        Kill strSourceDir & strInputFileName
        Set oOldDoc = Nothing
        Set oNewDoc = Nothing
      End If
    End If
  Next i
End With
   
Set oWord = Nothing
   
Exit Sub
   
errorhandler:
   
   Select Case Err.Number
   
       Case 4605
           Resume Next
           
       Case Else
           MsgBox Err.Number & " " & Err.Description
           Exit Sub
           
   End Select
   
End Sub

Function FileLocked(strFileName As String) As Boolean

   On Error Resume Next
   
   ' If the file is already opened by another process,
   ' and the specified type of access is not allowed,
   ' the Open operation fails and an error occurs.
   Open strFileName For Binary Access Read Lock Read As #1
   Close #1
   
   ' If an error occurs, the document is currently open.
   If Err.Number <> 0 Then
       FileLocked = True
       Err.Clear
   End If

End Function

> I'm running a large site which received many .doc files daily which
> need to be converted into plain text, the purpose of which is to supply
[quoted text clipped - 13 lines]
> Rob Ponting
> rob_ponting@hotmail.com
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.