Regroup divs to not break multiline sentences

Hi,

We use Aspose.Pdf (.Net - version 23.12.0) to convert PDF to HTML.
The Html is used to display the content of the file using a web app and the HTML conversion is also used as input of a LLM (AI - Large Language Model) step.

The output conversion is splendid, but it slices the text with div, that is considered as an end of the sentence by the LLM process.

For example, a PDF rendered like so:

image.png (4.6 KB)

is converted as (some styles are removed for clarity):

<div style="left:7.2878em;top:14.0278em;"><span>Over </span><span>307,000 &nbsp;</span></div>
<div style="left:7.2878em;top:15.39em;"><span>deaths from tracheal, &nbsp;</span></div>
<div style="left:7.2878em;top:16.29em;"><span>bronchial and lung &nbsp;</span></div>
<div style="left:7.2878em;top:17.1901em;"><span>cancer (up </span><span>160% </span><span>from &nbsp;</span></div>
<div style="left:7.2878em;top:18.0901em;"><span>1990)</span><sup><span>1 &nbsp;</span></sup></div>

which is considered by the LLM as 5 sentences :

  • Over 307,000.
    
  • Deaths from tracheal.
    
  • Bronchial and lung.
    
  • Cancer (up 160% from.
    
  • 1990).
    

instead of a single one.

Would it be possible to merge the block of text ?

In the example, all the divs have the same “left” property in order to have something like:

<div>
    <span style="display: block;left:7.2878em;top:14.0278em;"><span>Over </span><span>307,000 &nbsp;</span></span>
    <span style="display: block;left:7.2878em;top:15.39em;"><span>deaths from tracheal, &nbsp;</span></span>
    <span style="display: block;left:7.2878em;top:16.29em;"><span>bronchial and lung &nbsp;</span></span>
    <span style="display: block;left:7.2878em;top:17.1901em;"><span>cancer (up </span><span>160% </span><span>from &nbsp;</span></span>
    <span style="display: block;left:7.2878em;top:18.0901em;"><span>1990)</span><sup><span>1 &nbsp;</span></sup></span>
</div>

There is one enclosing <div> for each block of text
and instead of using <div>, the use of <span style=“display:block”> does the trick: the span is not considered as an end of sentence, but as it is used with a display:block the rendering seems to be still correct.

Original PDF:
sample1.pdf (212.2 KB)

Code used for conversion

// pdfFilePath: path of the pdf original file (intput)
// pdf2htmlFilePath: path of the html conversion (output)
using (Stream spdf = new FileStream(pdfFilePath, FileMode.Open))
{
    var pdfDoc = new Aspose.Pdf.Document(spdf);
    var options = new Aspose.Pdf.HtmlSaveOptions();
    options.SimpleTextboxModeGrouping = true;
    options.SaveTransparentTexts = true;
    options.SaveShadowedTextsAsTransparentTexts = true;
    pdfDoc.Save(pdf2htmlFilePath, options);
}

@abilger
Let me check a bit this issue, I’ll anwser you a bit later

@abilger
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58369

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

I checked issue, currently I haven’t found how it can be achieved
Therefore I added task for development team
I’ll try to ask developers if there’s some workaround , I’ll contact you if I find something