Word to PDF conversion events

a95johjo · November 5, 2018, 9:07am

Hi,
I use Asponse.Words to convert (for display purposes) word files to PDF.
I would like to be able to e.g. subscribe to some kind of event during the conversion so that I can connect a piece of text in the converted PDF to the actual Word range.
One purpose of this would be to highlight where in the PDF where there is e.g. a paragraph break or a table.
Would this be possible in Aspose.Words (or, for that matter, in Aspoce.Cells)?
Right now I use reverse engineering to do the same thing, but it would be more reliable if I could connect what is known in the Word file to the output in the PDF.

(The same event could also be used to detect font issues but that is a different topic)

tahir.manzoor · November 5, 2018, 2:02pm

@a95johjo

Thanks for your inquiry. You can use find and replace feature of Aspose.Words to find the text, add some text or highlight the text. We suggest you please read the following article.
Find and Replace

If this does not help you, please share some more detail about your requirement along with input and expected output documents. We will then provide you more information about your query.

a95johjo · December 13, 2018, 1:53pm

Hi I have uploaded a very simple example (the PDF output is from Aspose Words just for reference).

hello.docx.zip (730.3 KB)

The optimal output would be something that would allow me to create eg. this: Ie for each page, get all words with their coordinates. I could also put a boolean e.g. “in a table” or “is header” and stuff to mimic the logic that exists in the word file even after it has been converted to a pdf or for that matter, an image.
hello.zip (353 Bytes)

tahir.manzoor · December 13, 2018, 3:15pm

@a95johjo

Thanks for your inquiry. Unfortunately, your question isn’t clear enough therefore we request you to please elaborate your inquiry further. Please also share your expected output.

As per our understanding, you want to get the position of each word in the Word/PDF document. If this is the case, you can use Aspose.Words layout API to get the position of each word. The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages.

a95johjo · December 13, 2018, 4:10pm

OK thanks Tahir, I will look into these classes then, sounds like it is what I look for.

a95johjo · December 13, 2018, 4:59pm

I took a quick look at the Words.Layout demos. I get consistently different positions from the PDF output (Y position, not X). Would I need to tweak output margins or sth like that to get this consisitent?

I do this:
foreach (RenderedPage page in layoutDoc.Pages)
{
LayoutCollection lines = page.GetChildEntities(LayoutEntityType.Line, true);
foreach (var line in lines)
{
sb.AppendFormat(“Page {0} Line Y {1} Text: {2}\r\n”, page.PageIndex, line.Rectangle.Y, line.Text);
}
}

The diff is only 8 points with the word file I attached above (and with any other file I tried)

tahir.manzoor · December 13, 2018, 6:51pm

@a95johjo

Thanks for your inquiry.

You may use Aspose.PDF to get the position of text. Below code example returns the text position relative to bottom-left corner as PDF standard follows coordinating system where (0,0) means bottom-left.

Document pdfDocument = new Document("input");
TextFragmentAbsorber absorber = new TextFragmentAbsorber("SearchString");
pdfDocument.Pages[1].Accept(absorber);
foreach (TextFragment tf in absorber.TextFragments)
{
  var Rect = tf.Rectangle;
}

Could you please share complete detail of your use case? Please also share how are you comparing text position in PDF and Word files. We will then provide you more information about your query.

a95johjo · December 14, 2018, 11:53am

Hi Tahir,
what I have is the Word document. Using the rendering examples I have found how to connect position (e.g. to find what position/page an element is on). This position is not the same position if I do Save as PDF from Aspose.Words.
So my question is, can I align these two?
My use case is that I want to put some annotations in the PDF that are exactly on the position of a few words.
Thanks,
Best regards
Johan

a95johjo · December 14, 2018, 11:54am

and to answer your question, I find the position by looking at the ruler in Adobe Acrobat.

tahir.manzoor · December 14, 2018, 2:33pm

@a95johjo

Thanks for your inquiry. Could you please share the expected position of text “Hello” in Word document? Please also share the undesired position value you are getting. We will investigate the issue on our side and provide you more information.

a95johjo · December 14, 2018, 2:54pm

Hi Tahir,
Values in Word (for the “Line”):
Top 70.85 Bottom 93.341

these are the values in the PDF for the word “Hello”
top=“71.90”
bottom=“83.40”
baseline=“81.324”

So Top is not too bad but big enough to be clearly visible (1/72*25.4 mm = 0.4mm) but bottom is very different. I wonder if “Line” is the wrong object to take the rectangle from.

tahir.manzoor · December 14, 2018, 5:56pm

@a95johjo

Thanks for sharing the detail. In your case, we suggest you please insert the bookmark before the text for which you want to get the position. Below code example shows how to get the position of bookmark. Hope this helps you. input.zip (9.0 KB)

Document doc = new Document(MyDir + "hello.docx");

LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

enumerator.Current = collector.GetEntity(doc.Range.Bookmarks[0].BookmarkStart);
Console.WriteLine(" --> Left : " + enumerator.Rectangle.Left + " Top : " + enumerator.Rectangle.Top);

The LayoutEnumerator.Rectangle property returns the bounding rectangle of the current entity relative to the page top left corner (in points). It does not return the bottom position of text (node).