I’m having trouble extracting a PDF, I use Aspose version 23.11, here is the document and code used
MMC 7-2025 - Indicação políticas públicas Repp 2026 (Relatório de Fiscalização em Políticas Públicas e Programas de Governo) - Doc 77.906.807-2-1.pdf (1,7,MB)
private ByteArrayOutputStream trataDocumentoPdf(String absoluteFileName, String convertedFileName) throws IOException {
com.aspose.pdf.Document doc = new com.aspose.pdf.Document(absoluteFileName);
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
doc.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
extractedText = extractedText.replaceAll(“\r\n”, “
\r\n”);
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
doc.save(convertedFileName, saveOptions);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
baos.write(extractedText.getBytes());
return baos;
}
digite ou cole o código aqui
@sueliton
The error message “startIndex + length cannot exceed length of string” typically indicates that there is an attempt to access a substring of a string using an index that is out of bounds. This can happen during text extraction if the PDF document contains unexpected formatting or if the text extraction process encounters an issue.
Here are a few steps you can take to troubleshoot and resolve this issue:
-
Check the PDF Document: Ensure that the PDF document you are trying to extract text from is not corrupted and is formatted correctly. Sometimes, malformed PDFs can lead to unexpected behavior during text extraction.
-
Update Aspose Library: Make sure you are using the latest version of the Aspose.PDF library. If you are using version 23.11, consider checking for any updates or patches that might address this issue.
-
Modify Text Extraction Options: You can try adjusting the TextExtractionOptions
to see if it helps with the extraction process. For example, you can use different formatting modes or options to see if it resolves the issue.
-
Use a Simple Example: Test your code with a simpler PDF document to see if the issue persists. This can help determine if the problem is specific to the document you are using.
-
Review Example Code: You can refer to example code provided in the Aspose documentation for extracting text from PDF documents. This can help ensure that your implementation aligns with best practices. For instance, the following example demonstrates how to extract text from all pages of a PDF document:
Document pdfDocument = new Document("input.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
-
Error Handling: Implement error handling in your code to catch exceptions and log detailed error messages. This can provide more context about where the error is occurring.
If the issue persists after trying these steps, consider reaching out to the Aspose support team with the specific PDF document and code snippet for further assistance.
Sources:
[1]: ExtractTextFromAllThePagesOfPDFDocument.java