Some pdf text positioning is inaccurate

excuse me.
I found a problem when using aspose. part pdf,
After opening, the positioning of the text is different, and the order of the text is disrupted. Details are as follows.
Assume that the x- and y-axis positioning of the text exceeds the width and height of the current page. For example, the string 2
Some content is in reverse order. Sign below. As follows
I will paste the code below
Hope you can help check as soon as possible when you have free time. thanks

1-copy.jpg (260.6 KB)
2.pdf (1.2 MB)
1.jpg (218.1 KB)

package com.edoc2.inai.strategy.demo;

import com.aspose.pdf.Page;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextFragmentCollection;
import com.edoc2.ics.AsposeUtils;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

	private  int WIDTH = 595;
	private  int HEIGHT = 842;
	BufferedImage image = new BufferedImage(2000, 2000, BufferedImage.TYPE_INT_RGB );
	Graphics g= image.getGraphics();
	
	public void zoom() throws IOException
	{
		g.fillRect(0,0,2000,2000);
		g.setColor(Color.white);
		Image srcImage = ImageIO.read(new File("C:\\Users\\zhangzihao\\Downloads\\1.jpg"));
		ArrayList<TextIndex> list = readPdf();
		g.drawImage(srcImage, 0, 0, WIDTH, HEIGHT, null);  //将原始图片 按固定大小绘制到image中
		for (TextIndex textIndex : list) {
			g.setColor(Color.red);
			g.drawRect( textIndex.xIndent.intValue(), textIndex.yIndent.intValue(), textIndex.width.intValue(), textIndex.fontSize.intValue() ); 
			g.setColor( Color.BLUE );
			Font font = new Font( "楷体", Font.BOLD, textIndex.fontSize.intValue() );
			g.setFont( font );
			g.drawString(textIndex.text, textIndex.xIndent.intValue(), textIndex.yIndent.intValue()+textIndex.fontSize.intValue() );
		}
		g.setColor(Color.red);
		g.drawRect(0,0,WIDTH,HEIGHT);
		ImageIO.write(image, "jpeg", new File("C:\\Users\\zhangzihao\\Downloads\\1-copy.jpg"));  
		ImageIO.write(image, "bmp", new File("C:\\Users\\zhangzihao\\Downloads\\1-bmp.bmp"));  
	}

	public ArrayList<TextIndex> readPdf(){
		com.aspose.pdf.Document document = new com.aspose.pdf.Document("C:\\Users\\zhangzihao\\Pictures\\Saved Pictures2021-09-22 17-00-08\\2.pdf");

		Page page = document.getPages().get_Item(1);
		com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
		page.accept(textAbsorber);
		String source = textAbsorber.getText();
		Double pageWidth = page.getPageInfo().getWidth();
		Double pageHeight = page.getPageInfo().getHeight();
		ArrayList<TextIndex> map = new ArrayList<>();
		WIDTH=pageWidth.intValue();
		HEIGHT=pageHeight.intValue();

		for (int i = 0; i < source.length(); i++) {
			String text=source.substring(i,i+1);
			try {
				TextFragmentAbsorber absorber = new TextFragmentAbsorber(text);
				page.accept(absorber);
				List<TextFragment> list = new ArrayList<>();
				TextFragmentCollection textFragments = absorber.getTextFragments();
				for (Iterator iter = textFragments.iterator(); iter.hasNext();) {
					list.add((TextFragment)iter.next());
				}
				for (TextFragment textFragment : list) {
					Double xIndent = textFragment.getPosition().getXIndent();
					Double yIndent = textFragment.getPosition().getYIndent();
					Double fontSize = Double.valueOf(String.valueOf(textFragment.getTextState().getFontSize()));
					Matcher matcher = Pattern.compile("[^\\x00-\\xff]").matcher(text);
					int count = 0;
					while (matcher.find()){
						count++;
					}
					Double width =(fontSize*count+fontSize*0.6*(text.length()-count));
					TextIndex textIndex = new TextIndex(pageWidth, pageHeight, xIndent, yIndent, fontSize, width,text);
					map.add(textIndex);
				}
			}catch (Exception e){
				System.out.println(text);
			}
		}
		return map;
	}
	class TextIndex{
		Double PageWidth;
		Double PageHeight;
		Double xIndent;
		Double yIndent;
		Double fontSize;
		Double width;
		String text;

		public TextIndex(Double pageWidth, Double pageHeight, Double xIndent, Double yIndent, Double fontSize, Double width,String text) {
			PageWidth = pageWidth;
			PageHeight = pageHeight;
			this.xIndent = xIndent;
			this.yIndent = yIndent;
			this.fontSize = fontSize;
			this.width = width;
			this.text =  text;
		}
	}



	public static void main(String[] args) throws IOException {
		getPDFLicense();
		// TODO Auto-generated method stub
		new Main().zoom();
	}
	public static void getPDFLicense(){
		try(InputStream is = AsposeUtils.class.getClassLoader().getResourceAsStream("Aspose.Total.Java.lic");) {
			com.aspose.pdf.License wordsLic = new com.aspose.pdf.License();
			wordsLic.setLicense(is);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

@zzh22

Could you please share some more detail about your requirement and use case? We will then investigate the issue and provide you more information on it. Please also share your expected output. Thanks for your cooperation.

The code in the attachment uses Aspose to read the contents of the PDF file and outputs it to a BMP file according to the read text.

According to the results, we can see that the width and height returned by Aspose (getpageinfo() getWidth(),getPageInfo(). GetHeight ()) is 595 * 842, and the content of the body is far beyond this range. In addition to the large difference between the position of the text content and the width and height of the page, the order of the text content is also disordered. For example, the Chinese date string “220年11月18日” (means November 18, 2022. it is a OCR output) should be at the end of the text, but according to the text positioning output of Aspose, it runs to the top.

We can use software packages such as Python or JavaScript (PDF. JS) or system software (such as Adobe PDF reader) to read the text in the PDF normally and locate the text correctly, so the PDF file itself should be no problem. Please help us confirm whether this is the problem of Aspose itself or our usage. Thank you.

2.pdf (1.2 MB)
3.jpg (151.9 KB)

@zzh22

We have logged a ticket for your case in our issue tracking system as PDFJAVA-41239. We will inform you via this forum thread once there is an update available on it.

We apologize for your inconvenience.