Firstly, no I haven't done this.
Secondly, I don't know about Powerpoint and I think the Excel BIFF file
formats are slightly easier than Word.
Thirdly, as I understand it - and I must stress that this is not in any way
'official', the Data stream consists entirely of image and other data blocks
all concatenated together with pointers to each from the document stream but
with no pointers back. Changing the size of an individual block would
require that all pointers to following blocks be changed which, in turn
would require scanning the whole document for pointers. I suspect this is a
non-starter.
Alternatively, it *may* be possible just to add a new image to the end of
the block, change the one pointer, and leave an orphaned image behind. You
are completely on your own with this but here is how I think it works for
inline images.
* CHPX (and PAPX) structures are in a series of self contained 512-byte
'pages' following on immediately from the document content.
* Each page contains a count of entries in the final byte.
* At the start of the page are an array of start and end positions of
'text runs' within the document - in total one more than the entry count.
* Following the above array is a series of one byte word offsets within
the page (with a one to one correspondence to the start positions in the
array) giving the position of each CHPX structure.
* The CHPX for a picture (where the document text contains ASCII 1 as a
placeholder) contains a six byte entry beginning 036A and followed by an
offset to the image data within the data stream.
* I don't know the actual format of the image data. I believe all newly
saved documents use a single format but that old documents may contain some
different formats.
There is more to it, of course, and I would still recommend you try and find
another way. Word 2003 reads and writes xml documents and in Word 2007 it is
the default format.
--
Enjoy,
Tony
> Firstly, no I haven't done this.
>
[quoted text clipped - 16 lines]
> * CHPX (and PAPX) structures are in a series of self contained 512-byte
> 'pages' following on immediately from the document content.
Fine with me
> * Each page contains a count of entries in the final byte.
Fine with me
> * At the start of the page are an array of start and end positions of
> 'text runs' within the document - in total one more than the entry count.
Fine with me
> * Following the above array is a series of one byte word offsets within
> the page (with a one to one correspondence to the start positions in the
> array) giving the position of each CHPX structure.
Here I have the slight problem, the offset which they are mentioning is
0xfd(253) and it took me to lot of zeros. I can see the 0x6A03 at this
page but that is at the offset 488(decimal) bytes from the strating of
this page. Below i'm mentioning the exact stream, last byte sof teh
page starting after lot of zeros,
0f 03 6a 00 00 00 00 16 68 b0 05 17 00 55 08 01 06 16 68 b0 05 17 00 03
Here 0x0f is the length and 0x036a is the sprm with the fcpic
properties and not able to understand the remainig data stream.
Can you please throw more light on this?
> * The CHPX for a picture (where the document text contains ASCII 1 as a
> placeholder) contains a six byte entry beginning 036A and followed by an
[quoted text clipped - 3 lines]
> saved documents use a single format but that old documents may contain some
> different formats.
They are stored in the drawing 97, escher, file format.
> There is more to it, of course, and I would still recommend you try and find
> another way. Word 2003 reads and writes xml documents and in Word 2007 it is
[quoted text clipped - 73 lines]
> > > > Thanks in Advance
> > > > Gaurav Vashishth
Tony Jollans - 15 Sep 2006 18:09 GMT
(see inline)
--
Enjoy,
Tony
> > Firstly, no I haven't done this.
> >
[quoted text clipped - 36 lines]
> page but that is at the offset 488(decimal) bytes from the strating of
> this page.
The 0xFD offset is in words, not bytes, so double it to get bytes
> Below i'm mentioning the exact stream, last byte sof teh
> page starting after lot of zeros,
>
> 0f 03 6a 00 00 00 00 16 68 b0 05 17 00 55 08 01 06 16 68 b0 05 17 00 03
This is two CHPXs, first
0f 03 6a 00 00 00 00 16 68 b0 05 17 00 55 08 01
and then
06 16 68 b0 05 17 00
The 03 6a 00 00 00 00 means that the image is at the beginning of the data
stream (offset 0)
The 55 08 01 is the special bit setting which means that the character (ASC
1) is a placeholder
I don't know what the 16 etc is but it looks like the picture and some
preceding text have a format applied which is encoded here. You should be
able to ignore it. You can skip over it by picking up the length 0x6816 bits
13-15 = 011 (= code 3) = 4 byte operand (b0 05 17 00)
> Here 0x0f is the length and 0x036a is the sprm with the fcpic
> properties and not able to understand the remainig data stream.
[quoted text clipped - 88 lines]
> > > > > Thanks in Advance
> > > > > Gaurav Vashishth
Gaurav - 16 Sep 2006 05:55 GMT
Thanks Tony,
I still have one doubt. Below I'm writing the FKP for PAPX
00 06 00 00 11 08 00 00 fd 00 00 00 ----lot of zeros and at the end of
the page boundary
we have
00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01
Now by reading the word offset fd*2 it took me to 506 means at the
00 01 00 00 00 01
Now, the first bytes is zero means I have to consider the next byte as
the length of the following data in the PAPX.
It is 0x01 in word offset so it gave me 2
Next, I have to read the index for the style desciptor in two bytes
means , 00 00 in this case, and then we have the character array for
the grpprl, which is 00 in this case
Total length = grpprl + index to style = 1+2 = 3
but the length stored in the ppax.cw = 0x01(2).
Can you please tell me where I'm wrong?.
One more thing, in the documentation structure for the Bin table is
described as that
it will contain the FKP page number in 4 bytes and it is also mentioned
in the FIB that BTE has the length, in my case it was 12.
When I read the BTE from the table stream it was 00 06 00 00 11 08 00
00 05 00 00 00
Now the first 4 bytes are the offset of the first text character and
the next four bytes contains the offset for the last character and then
is the FKP page number.
Is this always the case that I have to read in this manner or Ii should
consider the first four bytes as FKP?
Thanks
Gaurav vashishth
> (see inline)
>
[quoted text clipped - 191 lines]
> > > > > > Thanks in Advance
> > > > > > Gaurav Vashishth
Tony Jollans - 16 Sep 2006 12:27 GMT
My reading of this is as follows
00 01 00 00 00 01
^^ Because this is zero
^^ - this is a count of words
^^ ^^ This is one word
^^ And that's it, so this is padding
^^ (count of PAPXs)
In other words the run of paragraphs is unchanged Normal style.
I don't know the answer to your other question. I'm sorry.
--
Enjoy,
Tony
> Thanks Tony,
>
[quoted text clipped - 237 lines]
> > > > > > > Thanks in Advance
> > > > > > > Gaurav Vashishth
Gaurav - 17 Sep 2006 07:04 GMT
ThankYou very much
Gaurav Vashishth
> My reading of this is as follows
>
[quoted text clipped - 291 lines]
> > > > > > > > Thanks in Advance
> > > > > > > > Gaurav Vashishth
Gaurav - 18 Sep 2006 07:14 GMT
Hi Tony,
I have got the papx.istd which is zero in my case, now can you help me
in fetching the properties for the PAP
I have reached at the start of the table stream where I got the STSHI
information and after this there are STD's . Now the offset which I got
,zero, means that I have to read the zeroth STD??
It's length is 64 and the default PAP length mentioned in the document
is 610.
I'm bit confused how to get the PAP for the pragaraph.
Needs your help.....
Thanks
Gaurav Vashishth
> ThankYou very much
>
[quoted text clipped - 295 lines]
> > > > > > > > > Thanks in Advance
> > > > > > > > > Gaurav Vashishth
Tony Jollans - 18 Sep 2006 08:21 GMT
I'm not hugely knowledgeable about this :-) but ...
Paragraph properties are based on styles. Styles are based on other styles.
Styles in documents are based on styles in templates, etc., etc. Essentially
I wouldn't expect to find full paragraph properties in one place in a
document. In other words you won't find a PAP only layer upon layer of PAPXs
with the ultimate base coming from the registry, in turn originally coming
from built-in values.
--
Enjoy,
Tony
> Hi Tony,
>
[quoted text clipped - 276 lines]
> > > > > > > > >
> > > > > > > > > "Gaurav" <vashgaurav@gmail.com> wrote in message
news:1158224904.397423.117650@i3g2000cwc.googlegroups.com...
> > > > > > > > > > I'm working with the doc files and needs to extract the images
> > > > > from
[quoted text clipped - 35 lines]
> > > > > > > > > > Thanks in Advance
> > > > > > > > > > Gaurav Vashishth
Gaurav - 18 Sep 2006 09:17 GMT
Hi Tony,
Thanks for all your replies regarding this Topic
Regards,
Gaurav Vashishth
> I'm not hugely knowledgeable about this :-) but ...
>
[quoted text clipped - 388 lines]
> > > > > > > > > > > Thanks in Advance
> > > > > > > > > > > Gaurav Vashishth