Hi,
We use Aspose.Words (.Net - version 23.12.0) to convert PDF to HTML.
The Html is used to display the content of the file using a web app and the HTML conversion is also used as input of a LLM (AI - Large Language Model) step.
The output conversion tends to create <table> when detected which is great (Aspose.Pdf doesn’t do that) ! But sometimes it put together some part of the text that are not relevant may change compeletly the sense of the sentences processed by the LLM process.
For example, a PDF rendered like so:
is converted as (style removed for clarity):
<p>
<span>Over </span><span>307,000</span><span> </span><span>Over</span>
<span> </span><span>326,000 </span><span> </span><span>Over</span>
<span> </span><span>695,000 </span><span>deaths from tracheal, </span>
<span> </span><span>deaths from lower </span><span> </span>
<span>deaths from chronic </span>
</p>
<p>
<span>bronchial and lung </span><span> </span><span>respiratory tract </span>
<span> </span><span>obstructive pulmonary cancer (up </span><span>160%</span>
<span> </span><span>from </span><span> </span><span>infections (up </span>
<span>12%</span><span> </span><span>disease (COPD)</span><span> </span>
<span>1990)</span><span>1</span><span> </span><span>from 1990)</span><span>1</span>
<span> </span><span>(up </span><span>98%</span><span> </span><span>from 1990)</span>
<span>1</span>
</p>
With this conversion, the LLM analyze this text:
Over 307,000 Over 326,000 Over 695,000 deaths from tracheal, deaths from lower deaths from chronic.
Bronchial and lung respiratory tract obstructive pulmonary cancer (up 160% from infections (up 12% disease (COPD) 1990)1 from 1990)1 (up 98% from 1990)1
Which is complete non-sense.
Could you try to improve the column / text block detection heuristics ? And/Or the html output to better reflects text blocks ?
Note: Microsoft words doesn’t detect all the text blocks of the sample file (below)
Original PDF
sample1.pdf (212.2 KB)
Code used to convert
// pdfFilePath: orginal pdf file path (input)
// word2htmlFilePath: output html file
using (Stream spdf = new FileStream(pdfFilePath, FileMode.Open))
{
var wordDoc = new Aspose.Words.Document(spdf);
wordDoc.Save(word2htmlFilePath, Aspose.Words.SaveFormat.Html);
}