PDF column formatting after extracting text

TCowell · January 8, 2014, 3:20am

Hi, we’ve been using the TextAbsorber class to extract text from a PDF document and mostly this has been working well. However we have hit upon an issue if the document contains columns.

The behavior of Aspose does not appear to be consistent from one document to the next and in some cases the text stream is unreadable, as you can’t easily tell where the gap between the two columns lie. It seems that if the text in the columns are all of an equal length then we get 3 whitespace characters between columns otherwise we get 1 and the text is not aligned, so the resulting output text looks much like one block of text.

We’re using version 8.8 of the PDF library.

Some sample code…

Page p = doc.Pages[PageNumber];
TextAbsorber textAbsorber = new TextAbsorber();
p.Accept(textAbsober);

Then the text in ‘textAbsorber.Text’ is then used. I’ve tried constructing the TextAbsorber with both Pure Raw formatting options, but this doesn’t seem to make a difference.

Is there anything in code we can do to improve this situation? Or is this a bug? Is it possible to have the code return us all text from the first column then the second instead of returning adjacent lines from each column in turn?

To see the issue I have attached two documents, equalSpacing.pdf ends up with 3 whitespace characters between columns, where as notEqualSpacing.pdf has 1 character and it is not. Both documents were created in Microsoft Word.

Your own demo on the link below, returns output with issues as described if these documents are uploaded.

tilal.ahmad · January 9, 2014, 12:06am

Hi Tim,

We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 8.8.0, we have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWNET-36229 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.

Please feel free to contact us for any further assistance.

Best Regards,

tilal.ahmad · January 9, 2014, 12:24am

Hi Tim,

As a workaround you can reduce the font size of a PDF document and then extract the text. Please check the following code snippet for the purpose. However, we will keep you updated about the original issue’s progress.

Document pdfDocument = new Document(myDir + "notEqualSpacing.pdf");
MemoryStream ms = new MemoryStream();

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
pdfDocument.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
    //need to reduce font size : 90%
    textFragment.TextState.FontSize = textFragment.TextState.FontSize * 0.9f;
}

pdfDocument.Save(ms);
pdfDocument = new Document(ms);

//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//accept the absorber for a particular page
pdfDocument.Pages[1].Accept(textAbsorber);

//get the extracted text
string extractedText = textAbsorber.Text;

// create a writer and open the file
TextWriter tw = new StreamWriter(myDir + "notEqualSpacing-text.txt");
// write a line of text to the file
tw.WriteLine(extractedText);
// close the stream
tw.Close();
ms.Close();

Best Regards,

TCowell · January 9, 2014, 4:10am

Hi Tilal, thanks for the reply.

I’ve tried your temporary work around and it does go some way to improving the situation, but it seems that this is dependent on the input text. In my example the text lines are of a fairly uniform length but real world text won’t be and the column spacing is still not aligned. I’m getting wider gaps between column lines, which helps, but I’m also seeing extra white space characters between words on the same line.

Can I ask what the expected behavior should be once the bug is fixed? How is the spacing size between columns determined? Will the columns be aligned? And will it be possible to read all text from one column at a time?

Also the for loop in your work around is fairly CPU intensive, is there a more efficient way of reducing the font size?

Thanks again.

Tim

codewarior · January 9, 2014, 11:14am

TCowell:

I’ve tried your temporary work around and it does go some way to improving the situation, but it seems that this is dependent on the input text. In my example the text lines are of a fairly uniform length but real world text won’t be and the column spacing is still not aligned. I’m getting wider gaps between column lines, which helps, but I’m also seeing extra white space characters between words on the same line.

Can I ask what the expected behavior should be once the bug is fixed? How is the spacing size between columns determined? Will the columns be aligned? And will it be possible to read all text from one column at a time?

Hi Tim,

I am afraid currently Aspose.Pdf for .NET does not support the feature to extract text based on columns. For the sake of correction, I have logged this requirement as PDFNEWNET-36233 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

TCowell:

Also the for loop in your work around is fairly CPU intensive, is there a more efficient way of reducing the font size?

The CPU utilization is increased because size of individual TextFragment is changed.

asad.ali · March 21, 2018, 2:45pm

@TCowell

Thanks for your patience.

We have further investigated the earlier logged issue (PDFNET-36229) and found the observed effect is not a bug. It is because of the difference between widths of narrow symbols (such as “i”, “t” and " ") of PDF proportional width font and monospaced font (grid) - that used for calculation number of spaces for formatting of the text document. When text line with a number of narrow symbols is added to formatted text, it becomes longer and may reach start of the next column. In this case no additional spaces are added.

We also cannot change text extraction defaults because it may lead to change the output documents. Therefore, we have introduced the ScaleFactor option in the TextExtractionOptions class in Aspose.PDF for .NET 18.3. This option affects size of grid that used for formatting text. Changing of the option solves most of the problems with spaces in the formatted text. For this scenario we recommend to use value of ScaleFactor between 0.7 and 0.9.

Please consider the following code snippet:

//open document
Document pdfDocument = new Document(myDir + "notEqualSpacing.pdf");
//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//create TextExtractionOptions
TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
//set ScaleFactor value
options.ScaleFactor = 0.8;
//applying the options to TextAbsorber
textAbsorber.ExtractionOptions = options;
//accept the absorber for a particular page
pdfDocument.Pages[1].Accept(textAbsorber);
//get the extracted text
string extractedText = textAbsorber.Text;
// create a writer and open the file
TextWriter tw = new StreamWriter(myDir + "notEqualSpacing-text_scale0.8.txt");
// write a line of text to the file
tw.WriteLine(extractedText);
// close the stream
tw.Close();

For your reference, we have also attached an output TXT file as well. Please use latest version of the API in order to use new introduced options and in case of any issue, please feel free to contact us.

notEqualSpacing-text_scale0.8.zip (322 Bytes)

asad.ali · April 22, 2018, 9:04pm

@TCowell

Thanks for your patience.

In reference to earlier logged ticket PDFNET-36233, TextAbsober class does not support text extraction based on structure elements of page (such as columns and paragraphs). But our new class ParagraphAbsorber supports this.

Please consider the following code while using Aspose.PDF for .NET 18.4:

Document doc = new Document(myDir + "notEqualSpacing.pdf");

ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);

foreach (PageMarkup markup in absorber.PageMarkups)
{
    Console.WriteLine("- Page {0}.", markup.Number);

    foreach (MarkupSection section in markup.Sections)
    {
        Console.WriteLine("-- Section (column) at {0}.", section.Rectangle);
        StringBuilder sb = new StringBuilder();

        foreach (MarkupParagraph paragraph in section.Paragraphs)
        {
            sb.AppendLine(paragraph.Text);
        }

        Console.WriteLine(sb.ToString());
    }
}

Console_out.zip (351 Bytes)

For your kind reference, a console output is also attached. In case of any further assistance, please feel free to contact us.