Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Word / Conversions / September 2006

Tip: Looking for answers? Try searching our database.

Editing the images in the doc files

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Gaurav - 14 Sep 2006 10:08 GMT
I'm working with the doc files and needs to extract the images from the
doc files and, after some editing to the images, then put it back on
the data stream.

I can see the pictures lying in the data stream and able to extract it
and put it back but I need to modify the entries which contain the size
and offset to the pictures.

I also read about the CHX and PAPX but not able to understand what the
exactly the procedure is....

What it looks to me is that I need to scan the text written in the file
and then search for the 1 and 8 character, after they are found then
look for the PAP and CHP and ultimately get the offset to the pictures
in the data stream

Am I right ot missing something?. Is there any better and fast way to
do this

Looking for your positive reply

Thanks in Advance
Gaurav Vashishth
Tony Jollans - 14 Sep 2006 14:09 GMT
You are, to the best of my knowledge, correct in what you say - that PICX
structures are hung off CHP structures for ASCII 1 (and maybe 8) characters
with their special bit set. There is no guarantee that the data (document
text or image) will be physically contiguous, though, and you must read the
compound file with the correct APIs to be sure of getting the correct data.
Also, if you change the size of a picture you will have to change the
offsets of all pictures and other structures which follow the picture - a
not insignificant task anyway, never mind the fact that there is not, nor is
there ever likely to be, any proper documentation. I strongly recommend you
find another way to do whatever it is you are trying to do - can you not
work with documents in xml format for example?

--
Enjoy,
Tony

> I'm working with the doc files and needs to extract the images from the
> doc files and, after some editing to the images, then put it back on
[quoted text clipped - 19 lines]
> Thanks in Advance
> Gaurav Vashishth
Gaurav - 15 Sep 2006 05:19 GMT
Hi Tony,

First of all thanks for your reply

No, I can't work with the xml format since I need to make the file
which office can open, after editing the images and changing the size
and the offsets.

I have developed the API's for reading the Ole compund file format
infact I was able to do the same thing, editing the images and putting
it back, with the PPT and XLS file but got stcuk with this format as
this format is very complex.

Have you ever done this, I mean fetched the CHP properties for the
character?

Can't enjoy untill I get the answers :-)
- Gaurav Vashishth

> You are, to the best of my knowledge, correct in what you say - that PICX
> structures are hung off CHP structures for ASCII 1 (and maybe 8) characters
[quoted text clipped - 35 lines]
> > Thanks in Advance
> > Gaurav Vashishth
Tony Jollans - 15 Sep 2006 13:13 GMT
Firstly, no I haven't done this.

Secondly, I don't know about Powerpoint and I think the Excel BIFF file
formats are slightly easier than Word.

Thirdly, as I understand it - and I must stress that this is not in any way
'official', the Data stream consists entirely of image and other data blocks
all concatenated together with pointers to each from the document stream but
with no pointers back. Changing the size of an individual block would
require that all pointers to following blocks be changed which, in turn
would require scanning the whole document for pointers. I suspect this is a
non-starter.

Alternatively, it *may* be possible just to add a new image to the end of
the block, change the one pointer, and leave an orphaned image behind. You
are completely on your own with this but here is how I think it works for
inline images.

 *  CHPX (and PAPX) structures are in a series of self contained 512-byte
'pages' following on immediately from the document content.

 *  Each page contains a count of entries in the final byte.

 *  At the start of the page are an array of start and end positions of
'text runs' within the document - in total one more than the entry count.

 *  Following the above array is a series of  one byte word offsets within
the page (with a one to one correspondence to the start positions in  the
array) giving the position of each CHPX structure.

 *  The CHPX for a picture (where the document text contains ASCII 1 as a
placeholder) contains a six byte entry beginning 036A and followed by an
offset to the image data within the data stream.

 *  I don't know the actual format of the image data. I believe all newly
saved documents use a single format but that old documents may contain some
different formats.

There is more to it, of course, and I would still recommend you try and find
another way. Word 2003 reads and writes xml documents and in Word 2007 it is
the default format.

--
Enjoy,
Tony

> Hi Tony,
>
[quoted text clipped - 54 lines]
> > > Thanks in Advance
> > > Gaurav Vashishth
Gaurav - 15 Sep 2006 15:20 GMT
> Firstly, no I haven't done this.
>
[quoted text clipped - 16 lines]
>   *  CHPX (and PAPX) structures are in a series of self contained 512-byte
> 'pages' following on immediately from the document content.

Fine with me

>   *  Each page contains a count of entries in the final byte.

Fine with me

>   *  At the start of the page are an array of start and end positions of
> 'text runs' within the document - in total one more than the entry count.

Fine with me

>   *  Following the above array is a series of  one byte word offsets within
> the page (with a one to one correspondence to the start positions in  the
> array) giving the position of each CHPX structure.

Here I have the slight problem, the offset which they are mentioning is
0xfd(253) and it took me to lot of zeros. I can see the 0x6A03 at this
page but that is at the offset 488(decimal)  bytes from the strating of
this page. Below i'm mentioning the exact stream, last byte sof teh
page starting after lot of zeros,

0f 03 6a 00 00 00 00 16 68 b0 05 17 00 55 08 01 06 16 68 b0 05 17 00 03

Here 0x0f is the length and 0x036a is the sprm with the fcpic
properties and not able to understand the remainig data stream.

Can you please throw more light on this?

>   *  The CHPX for a picture (where the document text contains ASCII 1 as a
> placeholder) contains a six byte entry beginning 036A and followed by an
[quoted text clipped - 3 lines]
> saved documents use a single format but that old documents may contain some
> different formats.

They are stored in the drawing 97, escher, file format.

> There is more to it, of course, and I would still recommend you try and find
> another way. Word 2003 reads and writes xml documents and in Word 2007 it is
[quoted text clipped - 73 lines]
> > > > Thanks in Advance
> > > > Gaurav Vashishth
Tony Jollans - 15 Sep 2006 18:09 GMT
(see inline)

--
Enjoy,
Tony

> > Firstly, no I haven't done this.
> >
[quoted text clipped - 36 lines]
> page but that is at the offset 488(decimal)  bytes from the strating of
> this page.

The 0xFD offset is in words, not bytes, so double it to get bytes

> Below i'm mentioning the exact stream, last byte sof teh
> page starting after lot of zeros,
>
> 0f 03 6a 00 00 00 00 16 68 b0 05 17 00 55 08 01 06 16 68 b0 05 17 00 03

This is two CHPXs, first

0f    03 6a 00 00 00 00    16 68 b0 05 17 00    55 08 01

and then

06    16 68 b0 05 17 00

The 03 6a 00 00 00 00 means that the image is at the beginning of the data
stream (offset 0)

The 55 08 01 is the special bit setting which means that the character (ASC
1) is a placeholder

I don't know what the 16 etc is but it looks like the picture and some
preceding text have a format applied which is encoded here. You should be
able to ignore it. You can skip over it by picking up the length 0x6816 bits
13-15 = 011 (= code 3) = 4 byte operand (b0 05 17 00)

> Here 0x0f is the length and 0x036a is the sprm with the fcpic
> properties and not able to understand the remainig data stream.
[quoted text clipped - 88 lines]
> > > > > Thanks in Advance
> > > > > Gaurav Vashishth
Gaurav - 16 Sep 2006 05:55 GMT
Thanks Tony,

I still have one doubt. Below I'm writing the FKP for PAPX

00 06 00 00 11 08 00 00 fd 00 00 00 ----lot of zeros and at the end of
the page boundary
we have
00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01

Now by reading the word offset fd*2 it took me to 506 means at the

00 01 00 00 00 01

Now, the first bytes is zero means I have to consider the next byte as
the length of the following data in the PAPX.

It is 0x01 in word offset so it gave me 2

Next, I have to read the index for the style desciptor in two bytes
means , 00 00 in this case, and then we have the character array for
the grpprl, which is 00 in this case

Total length = grpprl + index to style  = 1+2  = 3

but the length stored in the ppax.cw = 0x01(2).

Can you please tell me where I'm wrong?.

One more thing, in the documentation structure for the Bin table is
described as that
it will contain the FKP page number in 4 bytes and it is also mentioned
in the FIB that BTE has the length, in my case it was 12.

When I read the BTE from the table stream it was 00 06 00 00 11 08 00
00 05 00 00 00

Now the first 4  bytes are the offset of the first text character and
the next four bytes contains the offset for the last character and then
is the FKP page number.

Is this always the case that I have to read in this manner or Ii should
consider the first four bytes as FKP?

Thanks
Gaurav vashishth

> (see inline)
>
[quoted text clipped - 191 lines]
> > > > > > Thanks in Advance
> > > > > > Gaurav Vashishth
Tony Jollans - 16 Sep 2006 12:27 GMT
My reading of this is as follows

00 01 00 00 00 01
^^                          Because this is zero
    ^^                      - this is a count of words
         ^^ ^^           This is one word
                   ^^      And that's it, so this is padding

                        ^^ (count of PAPXs)

In other words the run of paragraphs is unchanged Normal style.

I don't know the answer to your other question. I'm sorry.

--
Enjoy,
Tony

> Thanks Tony,
>
[quoted text clipped - 237 lines]
> > > > > > > Thanks in Advance
> > > > > > > Gaurav Vashishth
Gaurav - 17 Sep 2006 07:04 GMT
ThankYou very much

Gaurav Vashishth

> My reading of this is as follows
>
[quoted text clipped - 291 lines]
> > > > > > > > Thanks in Advance
> > > > > > > > Gaurav Vashishth
Gaurav - 18 Sep 2006 07:14 GMT
Hi Tony,

I have got the papx.istd which is zero in my case, now can you help me
in fetching the properties for the PAP

I have reached at the start of the table stream where I got the STSHI
information and after this there are STD's . Now the offset which I got
,zero, means that I have to read the zeroth STD??

It's length is 64 and the default PAP length mentioned in the document
is 610.

I'm bit confused how to get the PAP for the pragaraph.

Needs your help.....

Thanks
Gaurav Vashishth

> ThankYou very much
>
[quoted text clipped - 295 lines]
> > > > > > > > > Thanks in Advance
> > > > > > > > > Gaurav Vashishth
Tony Jollans - 18 Sep 2006 08:21 GMT
I'm not hugely knowledgeable about this :-) but ...

Paragraph properties are based on styles. Styles are based on other styles.
Styles in documents are based on styles in templates, etc., etc. Essentially
I wouldn't expect to find full paragraph properties in one place in a
document. In other words you won't find a PAP only layer upon layer of PAPXs
with the ultimate base coming from the registry, in turn originally coming
from built-in values.

--
Enjoy,
Tony

> Hi Tony,
>
[quoted text clipped - 276 lines]
> > > > > > > > >
> > > > > > > > > "Gaurav" <vashgaurav@gmail.com> wrote in message

news:1158224904.397423.117650@i3g2000cwc.googlegroups.com...
> > > > > > > > > > I'm working with the doc files and needs to extract the images
> > > > > from
[quoted text clipped - 35 lines]
> > > > > > > > > > Thanks in Advance
> > > > > > > > > > Gaurav Vashishth
Gaurav - 18 Sep 2006 09:17 GMT
Hi Tony,

Thanks for all your replies regarding this Topic

Regards,
Gaurav Vashishth

> I'm not hugely knowledgeable about this :-) but ...
>
[quoted text clipped - 388 lines]
> > > > > > > > > > > Thanks in Advance
> > > > > > > > > > > Gaurav Vashishth
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.