Manipulating PDF tables

mjunaid05 · September 7, 2023, 6:10am

Hi,

We are trying to come up with a use case wherein we extract tables and table data from PDF or convert PDF to docx in a generic way for different types of PDF templates. But we’re facing issues as the tables are not being extracted correctly. The issue we’re facing is, after conversion, the tables are converted to cell blocks which then makes it difficult for us to find the location of the table data. We have also tried to use, ‘EnhancedFlow Aspose’ for conversion, but then, the the converted docx does not display data correctly(the columns are misplaced and the data is not mapped to the correct columns). We have tried converting into text too, but that doesn’t work as expected, Is there a way to extract table data correctly. Also, is there a way to directly parse the pdf to extract information from the tables. Please help us get the desired results as we fond your product interesting and would like to purchase it as well. I have attached the original pdf and converted docx for reference.
Thank you.

barclays_3.pdf (270.8 KB)
barclays_output.docx (128.7 KB)
bkos_1.docx (125.7 KB)
bkos_1.pdf (369.7 KB)

asad.ali · September 7, 2023, 3:12pm

@mjunaid05

Can you please share how you are trying to get the location of table data? Are you using some API? And you are trying to get it in DOCX files? Also, please share the sample code snippet that you are using so that we can further proceed to assist you accordingly.

mjunaid05 · September 11, 2023, 6:27am

The sample code is attached below - No we’re not using any API, although we tried using other libraries in Python (camelot, PyPDF2) to extract tables from PDF, but that didn’t work either. Converting PDF to docx was one of the ways we thought of to extract tables for training our ML model, but we’re open to other ways as well(If we can directly parse PDF tables that would be wonderful).

import com.aspose.pdf.DocSaveOptions;
import com.aspose.pdf.Document;
import com.aspose.pdf.SaveFormat;

public class ParsePDF {
public static void main(String[] args) throws Exception {
String pdf_file_path = “pdf_file_path”;
String docx_file_path = “docx_file_path”;
pdf_to_docx(pdf_file_path, docx_file_path);
System.out.println(“DocX Created successfullly”);
}

public static void pdf_to_docx(String pdf_path, String docx_path)

{
String pdf_file_path = “pdf_file_path”;
Document convertPDFDocumentToWord = new Document(“pdf_file_path”);
DocSaveOptions docSaveOptions = new DocSaveOptions();
docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
docSaveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);

// docSaveOptions.setRelativeHorizontalProximity(2.5f); - not used in EnhancedFlow mode
// docSaveOptions.setRecognizeBullets(true); - always true in EnhancedFlow mode

    convertPDFDocumentToWord.save(pdf_file_path + "output.docx", docSaveOptions);
}

}

mjunaid05 · September 11, 2023, 10:02am

Hi @asad.ali, Can I please get a response on the above comment ASAP

asad.ali · September 11, 2023, 3:05pm

@mjunaid05

Can you please check the attached output DOCX files that we obtained in our environment using licensed version of the API? Formatting looks better in them for the tables. Can you please confirm if that will do in your case? bkos_1_out.docx (124.2 KB)
barclays_3_out.docx (175.3 KB)