We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

ParagraphAbsorber not returning all whitespace

I am using Aspose.PDF to extract the text from a PDF using ParagraphAbsorber, however it does not appear to return all of the relevant whitespace, such as spaces between words.

The following code is similar to the code we are using in our project to walk through a PDF document Page by Page, then using ParagraphAbsorber extract the text (and eventually formatting changes) from the Page:

             InputStream stream = new FileInputStream("test-documents/simpleWithFormattingChanges.pdf");
             InputStream stream_2 = new FileInputStream("test-documents/multipleParagraphsNoFormattingChanges.pdf");
	
	Document doc = new Document(stream);
	
	PageCollection pages = doc.getPages();
	for (Page page : pages) {
		ParagraphAbsorber paragrapAbsorber = new ParagraphAbsorber();			
		paragrapAbsorber.visit(page);
		List<PageMarkup> markups = paragrapAbsorber.getPageMarkups();
		for(PageMarkup markup : markups) {
			List<MarkupSection> sections = markup.getSections();
			for (MarkupSection section : sections) {
				List<MarkupParagraph> paragraphs = section.getParagraphs();
				for (MarkupParagraph paragraph : paragraphs) {
					List<l0t<TextFragment>> lines = paragraph.getLines();
					for (l0t<TextFragment> line : lines) {
						for (int i = 0; i < line.size(); i++) {
							TextFragment fragment = line.get(i);
							System.out.print(fragment.getText());
						}
						System.out.println();
					}					
				}
			}
		}
	}

This is the result of using the file with formatting changes:

First paragraph is plain.
Second paragraph has different font.
Third paragraphhas different font sizes.
Fourth paragraph has bold, underlineand italics.

—Notice the missing spaces after the word paragraph and underline.

If you use the other file which has no formatting changes the result is:

First paragraph is plain.
Second paragraph is plain.
Third paragraphis plain.
Fourth paragraph is plain.

— Notice the missing space after the word paragraph.

Note that each of these PDF files was created by selecting “Save As PDF” from Microsoft Word.

Any help would be greatly appreciated.

simpleWithFormattingChanges.pdf (190.9 KB)
multipleParagraphsNoFormattingChanges.pdf (86.9 KB)

@kocke

Thank you for contacting support.

We have been able to reproduce the issue in our environment. A ticket with ID PDFJAVA-38537 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.