Positioning in MobiXml

bwarner · March 29, 2018, 2:43pm

I am evaluating Aspose.PDF with the goal of being able to draw a box around phrases determined by the application. My strategy is to render the pdf as an image, then capture the position information of the text and use positioning to draw the box in HTML/CSS.

I prototyped the system with Aspose.PDF, capturing the positions of all text in the pdf with SaveFormat.MobiXML. I captured an image of the first page of the document with pngDevice.process(doc.getPages().get_Item(1). When I use the top, right, width and height provided by the xml to draw a box over the png, it’s shifted a bit left and a bit down. It really not enclosing the text accurately enough to display to a user. Is MobiXML the right approach for this application, or am I doing something wrong?

I’ve attached an image to this message that illustrates the problem.

Thanks very much – Bill

Screen Shot 2018-03-29 at 10.28.57 AM.png (585.7 KB)

Farhan.Raza · March 29, 2018, 8:47pm

@bwarner

Thank you for contacting support.

We would like to share with you that you can search and highlight text in a PDF document by using below code snippet:

  ArrayList<String> searchKeyWords = new ArrayList<String>();
  searchKeyWords.add("Test");
  searchKeyWords.add("String");

  //Regexp builder
  StringBuilder temp = new StringBuilder();
  for(String key :searchKeyWords)
    {
        if (temp.length() > 0)
            temp.append("|");
        temp.append("(?i)").append(key);
    }
  String regexp = temp.toString();
    try
    {
        String inputName = myDir + "0549.pdf";
        String outputName = myDir + "0549" + version + "_result.pdf";

        com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inputName);
        com.aspose.pdf.TextSearchOptions textSearchOptions = new com.aspose.pdf.TextSearchOptions(true);

        PageCollection pgCollection = pdfDocument.getPages();
        for (int i = 1; i <= pgCollection.size(); ++i)
        {
            long executionTime = System.currentTimeMillis();// begin timer
            Page pg = pgCollection.get_Item(i);
            com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber(regexp);//NO I18N
            textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
            textFragmentAbsorber.visit(pg);
            com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
            for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textFragmentCollection)
            {
                textFragment.getTextState().setBackgroundColor(com.aspose.pdf.Color.getYellow());
            }
            System.out.println("Page "+ i+ "\t Time: \t" + (System.currentTimeMillis() - executionTime) + " \t ms. \t ** Memory status (max / used / free):");
            printMemoryStatus_();

            //perform intermediate save on every 300th page, depends on available heap memory
            if (i % 300 == 0)
            {
                pdfDocument.save(outputName, SaveFormat.Pdf);
                pdfDocument = new com.aspose.pdf.Document(outputName);
                pgCollection = pdfDocument.getPages();
            }
        }
        if (pgCollection.size() % 300 != 0)            
            pdfDocument.save(outputName, SaveFormat.Pdf);

    } catch (Exception ex)
    {

        System.out.println(ex);
        throw ex;
    }

I hope this will be helpful. Please feel free to contact us if you need any further assistance.

bwarner · April 2, 2018, 4:25pm

Well – my requirement is to do the drawing/highlighting in html, not in pdf. It is not a requirement to be able to select text in the resulting view, so an image is best, as it allows me the greatest control in masking and positioning. I’m open to other viewing schemes, as long as they are as web friendly as an image. But I think the best way would be to use the positioning information over an image. Is there an accurate way to map character coordinates to an image?

Thanks again – Bill

Farhan.Raza · April 2, 2018, 7:50pm

@bwarner

Thank you for elaborating it further.

You can search and highlight any text in a PDF document and then resultant document can be converted to an HTML document as explained in Convert PDF to HTML format. Or you can generate images of each page from resultant PDF document and then use those images in your HTML file as per your requirements. Moreover, drawing or highlighting in HTML may not work on exact words because coordinate system of Aspose.PDF API can be different than the coordinate system for an HTML file. That is why suggested approaches can efficiently satisfy your requirements.

We hope this will be helpful. Please let us know if you need any further assistance.

bwarner · April 2, 2018, 9:15pm

I’d like to use Aspose to first render an image, with a specified size. I will put it into a browser, possibly under a canvas element, and I will specify the same size. Then, I use Aspose to tell me the position of all the text on the page, presumably the same information in the pdf that was used to arrange the glyphs before rendering them as an image. With that information, the web application can render any graphics my team decides best illustrate the concept we want to highlight, and it can be easily animated, or hidden, or many other effects, in the web application.

I believe that Aspose.PDF can provided me with the same position information that it used to generate an image. But if you could confirm that, I would appreciate it.

Thanks again --Bill

Farhan.Raza · April 3, 2018, 7:11am

@bwarner

You can generate an image of your desired dimensions by using the code shared in Convert a particular page region to image , where you can pass Height and Width of page as parameters for the rectangle object, as in the line of code below:

page.getPageInfo().getHeight();

Then, you can get position coordinates of all the text on a PDF page by using getPosition() method for each TextFragment, as explained in Search and Get Text from Pages of a PDF Document. Use these position coordinates to manipulate the image as per your requirements.

bwarner · April 16, 2018, 8:51pm

Hello @Farhan.Raza,

I’ve taken this approach, and I’m getting back numbers, but so far they don’t seem to match up to the dimensions of my document. Below, I’m expressing the position as a percentage. All the values are between 0 and 100 as expected, but the resulting box doesn’t surround the given character. What are the units returned by PageInfo.getHeight() and CharInfo.getPosition.getYIndent()?

Thanks again for your help – Bill

        public static String charFormat = "top:%.2f right:%.2f bottom:%.2f left:%.2f width:%.2f height%.2f char=\"%c\"\n";
	Page page = doc.getPages().get_Item(1);
	PageInfo pi = page.getPageInfo();
	double pageHeight = page.getPageInfo().getHeight();
	double pageWidth = page.getPageInfo().getWidth();
        page.accept(textFragmentAbsorber);
        TextFragmentCollection textFragmentCollection =
            textFragmentAbsorber.getTextFragments();
        for (TextFragment textFragment : ( Iterable<TextFragment>)textFragmentCollection) {
	    //loop through the segments
	    for (TextSegment textSegment : ( Iterable<TextSegment>) textFragment.getSegments()) {
		//loop through the characters
		for (int i = 1; i <= textSegment.getText().length(); i++) {

		    CharInfo charInfo = textSegment.getCharacters().get_Item(i);
		    Rectangle rect = charInfo.getRectangle();
		    Position pos = charInfo.getPosition();
		    double top = 100 * pos.getYIndent()/pageHeight;
		    double left = 100 * pos.getXIndent()/pageWidth;
		    double width = 100 * rect.getWidth()/pageWidth;
		    double height = 100 * rect.getHeight()/pageHeight;
		    double right = left + width;
		    double bottom = top + height;
		    char c = textSegment.getText().charAt(i-1);
		    System.out.printf(charFormat, top, right, bottom, left, width, height, c);
		}
	    }
	}

bwarner · April 16, 2018, 10:19pm

Also, @Farhan.Raza,

The page dimensions in getPageInfo are WxH 595x842, which is the A4 page size. When I run pdfinfo on my file, the dimensions are 612x792.

Farhan.Raza · April 17, 2018, 7:02am

@bwarner

Would you please share your source PDF file, generated XML file and the generated output file that does not match surround given characters, along with a narrowed down sample application reproducing this issue so that we may try to reproduce and investigate it in our environment.

Also share the file and code that reproduce the difference in dimensions of a page using getPageInfo and pdfinfo. Please note that Height and Width properties use Point as basic unit, where 1 inch = 72 points and 1 cm = 1/2.54 inch = 0.3937 inch = 28.3 points.

bwarner · April 17, 2018, 2:23pm

@Farhan.Raza,

Please find the requested files attached. The purpose of the html file is to show the position of a bounding box for the adjacent characters “ACE”, with the values provided by ConvertPdfToXml. You’ll see that the box is quite far from the characters. The image provided is generated by aspose as well.

# tar xvzf aspose-support.tgz
# ls -alF aspose-support
total 700
drwxr-xr-x  6 root root    204 Apr 17 14:21 ./
drwxr-xr-x 44 root root   1496 Apr 17 14:21 ../
-rw-r--r--  1 root root    338 Apr 17 14:20 19671371_modified.html
-rw-r--r--  1 root root 168771 Apr 17 13:40 19671371_modified.pdf
-rw-r--r--  1 root root 532900 Apr 17 14:14 19671371_modified.png
-rw-r--r--  1 root root   2046 Apr 17 13:47 ConvertPdfToXml.java
# cd aspose-support/
# export CLASSPATH=.:aspose.pdf-17.8.jar
# javac ConvertPdfToXml.java
# cat 19671371_modified.pdf | java ConvertPdfToXml
top:88.33 right:9.46 bottom:89.64 left:9.08 width:0.38 height:1.31 pageWidth:595.00 pageHeight:842.00 char=" "
top:80.34 right:17.15 bottom:86.61 left:13.24 width:3.91 height:6.27 pageWidth:595.00 pageHeight:842.00 char="A"
top:80.34 right:20.97 bottom:86.61 left:17.15 width:3.82 height:6.27 pageWidth:595.00 pageHeight:842.00 char="C"
top:80.34 right:24.30 bottom:86.61 left:20.97 width:3.32 height:6.27 pageWidth:595.00 pageHeight:842.00 char="E"

aspose-support.zip (633.7 KB)

Farhan.Raza · April 17, 2018, 9:46pm

@bwarner

The Java class shared by you includes the code for getting position of specific text only. It does not convert the PDF to XML, neither does it explain how are you drawing the box on the HTML file. Please note that the coordinate system in Aspose.PDF regards 0,0 as bottom-left corner, whereas an HTML processes 0,0 as top-left corner so the rectangle may not be drawn correctly. In order to achieve your requirements, we would recommend using the approaches suggested by us earlier. In case you want to use this XML position approach then please elaborate the scenario along with complete sample application reproducing the issue so that we may proceed to help you out.

Regarding incorrect page dimensions with getPageInfo method, a ticket with ID PDFJAVA-37651 has been logged in our issue management system for further investigation and resolution. The issue ID has been linked with this thread so that you will receive notification as soon as the issue is resolved.

We are sorry for the inconvenience.

Farhan.Raza · July 10, 2018, 11:23am

@bwarner

Thank you for being patient.

We have further investigated PDFJAVA-37651 and found it not to be a bug. PageInfo is used only for PDF generator. It is generated with default values and can be configured for creating new pages. Therefore, please use PdfFileInfo approach which represents a class for accessing meta information of PDF document. It can be used to analyse and produce actual information for existing PDF document.

bwarner · July 10, 2018, 2:10pm

Thanks for the info. I completed this project with another vendor a couple of months ago. Thanks again --Bill

Farhan.Raza · July 10, 2018, 7:03pm

@bwarner

Thank you for your kind feedback.

Please feel free to contact us if you need any assistance regarding Aspose APIs, for future references. We will be more than happy to assist you.