Apsose pdf introduces space when the pdf document has chinese characters

Hi,


I am trying to extract text from a pdf document. The pdf document has some chinese characters. So when I use the text absorber, I am seeing that it introduces unnecessary spaces between the characters.

private String extractText(byte[] inputContent) throws IOException{

ByteArrayInputStream stream = new ByteArrayInputStream(inputContent);

Document pdfDocument = new Document(stream);

// Create TextAbsorber object to extract text

TextAbsorber textAbsorber = new TextAbsorber();

// Accept the absorber for all the pages

pdfDocument.getPages().accept(textAbsorber);

// Get the extracted text

String extractedText = textAbsorber.getText();

return extractedText;

}


I am also attaching the java code and the pdf file.

The full java code is below:


import java.io.IOException;


import com.aspose.pdf.Document;

import com.aspose.pdf.TextAbsorber;



public class PDFTester {

private static final String pdf_license = "Aspose.Pdf.lic";

static {

//Initialize aspose PDF license

try {

com.aspose.pdf.License pdfLic = new com.aspose.pdf.License();

pdfLic.setLicense(pdf_license);

} catch (Exception e) {

}

}

public static void main(String[] args) throws IOException {

String extractedText = extractText("aspose_issue.pdf");

System.out.println(extractedText);

}

private static String extractText(String file) throws IOException{

Document pdfDocument = new Document(file);


// Create TextAbsorber object to extract text

TextAbsorber textAbsorber = new TextAbsorber();


// Accept the absorber for all the pages

pdfDocument.getPages().accept(textAbsorber);


// Get the extracted text

String extractedText = textAbsorber.getText();


return extractedText;

}

}






The console output of this is:




Lead Company: t3

Lead Createdby Name: Muthukrishnan Manoharan

Lead Email: t1@t3.com

Lead Firstname: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட

Lead Lastname: பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Lead Name: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Lead: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Lead Ownerid: 00590000002lqJ1AAI

Lead Phone: (989) 080-9809

Firstname: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட

Lastname: பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Contact Firstname: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட

Contact Lastname: பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Greetingcasual: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட

Greetingformal: Mr. பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Lead Fullname: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Contact Fullname: பபg0 嚡嵼歆悍賸澯袕榖匝炶媨央氫宔一டட பபg0 嚡牥搇梲椱粲拥傕豌襼瞫瑲掽睌一டட

Leadowner Companyname:CC




Notice the space between the characters

Hi there,

Thanks for your inquiry. I have tested your scenario with your shared document using Aspose.Pdf for Java 10.1.0 and managed to observe the reported issue. For further investigation, I have logged an issue in our issue tracking system as PDFNEWJAVA-34789 and also linked your request to it. We will keep you updated via this thread regarding the issue status.

Please feel free to contact us for any further assistance.

Best Regards

Thanks Tilal Ahmad


May I know the ETA for this issue and how can I track the JIRA issue ID that you provided me?

Please let me also know the version of aspose this issue will be fixed.

Thanks
Muthu


Hi Muthu,


Thanks for your patience.

As we recently have been able to notice this issue, and until or
unless we have investigated and have figured out the actual reasons of this
problem, we might not be able to share any timelines by which this problem will
be resolved.
<o:p></o:p>

However,
as soon as we have made some significant progress towards the resolution of
this issue, we would be more than happy to update you with the status of
correction. Please be patient and spare us little time. Your patience and
comprehension is greatly appreciated in this regard.


Now concerning to issue tracking, I am afraid you might not be able to access our internal issue tracking system but as soon as we have made some definite progress towards its resolution, we will update you within this forum thread.

Hi Muthu,


Thanks for your patience. We have investigated about logged issue and would like to suggest you to use TextFormattingMode.Raw to prevent extra spaces

textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Raw);


Furthermore, In TextFormattingMode.Pure mode setScaleFactor(…) method can also be used to control the amount of spaces between words.

Please feel free to contact us for any further assistance.

Best Regards,