Aspose.Words: improve column / text blocks detection

abilger · October 14, 2024, 3:25pm

Hi,

We use Aspose.Words (.Net - version 23.12.0) to convert PDF to HTML.
The Html is used to display the content of the file using a web app and the HTML conversion is also used as input of a LLM (AI - Large Language Model) step.

The output conversion tends to create <table> when detected which is great (Aspose.Pdf doesn’t do that) ! But sometimes it put together some part of the text that are not relevant may change compeletly the sense of the sentences processed by the LLM process.

For example, a PDF rendered like so:

is converted as (style removed for clarity):

<p>
    <span>Over </span><span>307,000</span><span> </span><span>Over</span>
    <span> </span><span>326,000 </span><span>&nbsp;</span><span>Over</span>
    <span> </span><span>695,000 </span><span>deaths from tracheal, </span>
    <span>&nbsp;</span><span>deaths from lower </span><span>&nbsp;</span>
    <span>deaths from chronic </span>
</p>
<p>
    <span>bronchial and lung </span><span>&nbsp;</span><span>respiratory tract </span>
    <span>&nbsp;</span><span>obstructive pulmonary cancer (up </span><span>160%</span>
    <span> </span><span>from </span><span>&nbsp;</span><span>infections (up </span>
    <span>12%</span><span>&nbsp;</span><span>disease (COPD)</span><span>&nbsp; </span>
    <span>1990)</span><span>1</span><span> </span><span>from 1990)</span><span>1</span>
    <span> </span><span>(up </span><span>98%</span><span> </span><span>from 1990)</span>
    <span>1</span>
</p>

With this conversion, the LLM analyze this text:

Over 307,000 Over 326,000 Over 695,000 deaths from tracheal, deaths from lower deaths from chronic.
Bronchial and lung respiratory tract obstructive pulmonary cancer (up 160% from infections (up 12% disease (COPD) 1990)1 from 1990)1 (up 98% from 1990)1

Which is complete non-sense.

Could you try to improve the column / text block detection heuristics ? And/Or the html output to better reflects text blocks ?

Note: Microsoft words doesn’t detect all the text blocks of the sample file (below)

Original PDF
sample1.pdf (212.2 KB)

Code used to convert

// pdfFilePath: orginal pdf file path (input)
// word2htmlFilePath: output html file
using (Stream spdf = new FileStream(pdfFilePath, FileMode.Open))
{
    var wordDoc = new Aspose.Words.Document(spdf);
    wordDoc.Save(word2htmlFilePath, Aspose.Words.SaveFormat.Html);
}

alexey.noskov · October 14, 2024, 3:45pm

@abilger
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27485

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

You should note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model . On the other hand PDF documents are fixed page format documents. While loading PDF document Fixed Page Document structure is converted into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

aspose.notifier · May 7, 2025, 11:31am

The issues you have found earlier (filed as WORDSNET-27485) have been fixed in this Aspose.Words for .NET 25.5 update also available on NuGet.