Converting pdf to txt with TextAbsorber has issues with short runs of text

shaun.stone · September 4, 2024, 10:41am

I’m using a TextAbsorber to convert a pdf file to a txt file. But I’ve noticed that sometimes it will push short runs of text all the way to the left of the line, right up against the line number.
It seems to depend on the font used, potentially being isolated to monospaced fonts. Adding spaces and tabs sometimes changes the behaviour.
In the example pdf file, I’ve highlighted the lines that cause issues upon conversion to txt. I was blocked from uploading the output txt file, but the below code should generate it.

On a more general note, the reason that I’m writing this code is actually to convert a .docx file to a .txt file in a similar way that Microsoft Word can create a .txt file using the “Generic / Text Only” printer, where the layout of the .txt file mimics the page layout in the .docx file, ie, with indenting, line spacing, page separation, etc, rather than just writing the text to a file as one big string.
I’ve not been able to find a built-in way to do this with Aspose.Words. The only somewhat straightforward way I’ve come up with so far is to convert it to pdf first, then use the Asapose.Pdf.TextAbsorber class like in the provided code (albeit with some extra code that I’ve not included to keep things simple).
Have I overlooked a simple way to achieve this “print layout” .docx to .txt conversion using just Aspose.Words? The company I work for primarily deals with .docx files and has so far been able to achieve a vast amount with just its Aspose.Words licence. It would be great if it could avoid needing to buy an Aspose.Pdf licence for a single, comparatively tiny task that really isn’t related to pdf documents.

Thanks.

Conversion error examples.pdf (29.5 KB)

using Aspose.Pdf.Text;

// Open the pdf
Document pdf = new("Conversion error examples.pdf");
// Write the text to a text file using a text absorber
TextAbsorber textAbsorber = new(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
pdf.Pages.Accept(textAbsorber);
File.WriteAllText("Conversion error examples.txt", textAbsorber.Text);

asad.ali · September 4, 2024, 9:37pm

@shaun.stone

We were able to notice the issue in our environment using 24.8 version and it has been logged as PDFNET-58074 in our issue management system for further analysis. We will let you know once the ticket is closed.

About Aspose.Words, we recommend you post this question in Aspose.Words category where you will be assisted accordingly to achieve expected output from .docx files. There might be some workaround to tackle this behavior of the API. Otherwise this issue would need to be addressed after all.