Extraction of table border images from pdf

Hi,



I have pdf files containing tables with borders (sample attached). I noticed that when saving pdf as html to disk or stream, these borders are exported as background images, one image per page. I tried to extract these images directly from the pdf document, but Pages[n].Resources.Images collection is empty.



Is there a way to extract table border images directly from the Pdf.Document ?



Thanks,





Code snippet:



[TestCase( @“…...._asposeCases\20140729_191959_1206173.pdf.deid.pdf” )]

public void PageImageResourceExtraction( string pdfFile )

{

var pdfDoc = new Document( pdfFile );



var htmlOutFile = pdfFile + “.Aspose.xhtml”;

var options = new HtmlSaveOptions {DocumentType = HtmlDocumentType.Xhtml};

pdfDoc.Save( htmlOutFile, options );



// saving to xhtml will create image resource(s) for image representations of table borders, one image per page

var imageFiles = Directory.GetFiles( pdfFile + “.Aspose_files”, “*.svg” );

Assert.IsTrue( imageFiles.Any(), “there should be image files generated in pdf->xhtml conversion” );



// Replace Image in Existing PDF File|Aspose.PDF for .NET

// but image(s) cannot be pulled from the pdf document itself

var images = pdfDoc.Pages[1].Resources.Images;

Assert.AreEqual( imageFiles.Count(), pdfDoc.Pages[1].Resources.Images.Count, “there should be an image of table borders on the first page.” );



for ( var i = 0; i &lt images.Count; i++ )

{

var outImageFile = string.Format( “{0}.image.{1}.png”, pdfFile, i );

using ( var outImageStream = new FileStream( outImageFile, FileMode.Create ) )

{

images[i].Save( outImageStream, ImageFormat.Png );

}

}

}

Hi Matt,


Thanks for your inquiry. I am afraid currently Aspose.Pdf does not extract table border images directly. We have logged a new feature request as PDFNEWNET-37542 in our issue tracking system for further investigation and resolution. We will notify you as soon as we implement it.

We are sorry for the inconvenicne caused.

Best Regards,
Hi,

Just wanted to follow up on the status of this issue - any scheduling details 
you might be able to share?

In between time, we tried some workarounds in order to extract background
images related to table borders. So far the only way we could get that is to
first convert to HTML stream, and parse out [img] tags from the resulting 
HTML. It's feasible, but _extremely slow - about 10x longer than reading the 
TextFragment objects from the page. Typically, 3-4 pages document extraction 
takes about 4s, which is not acceptable performance.

We also tried splitting the document into pages first, and then process them in 
parallel as separate documents, in the same sequence as noted above. Apparently, 
if I copied a page from one document to another, the newly created document could
not be converted to HTML unless .Save() method of any kind is invoked ('cannot read
beyond stream ending' exceptions). Invoking Save() for each page document takes 
away all time advantages, as the totals are pretty much the same as if the document 
was processed in first approach described above.

Could you think of any other workarounds we might try to extract page background 
images?

Thanks

Hi Matt,


Thanks for your feedback. I am afraid your issue is still not resolved due to the issues already under investigation. However, we have requested our development team to investigate the issue at their earliest and share an ETA/findings. We will update you as soon as we get a feedback.

Moreover, please wait for the fix as currently we have no other workaround.

Thanks for your patience and cooperation.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-37542) have been fixed in Aspose.Pdf for .NET 9.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.