Extract file from PDF

michael.burgersiag.i · December 16, 2013, 8:26am

Hi,

I use Aspose PdfKit 4.2 for java

I do this simple code

InputStream input = new FileInputStream("/home/mburger/tmp/mfm/207_2008_1_T4_2.pdf");
OutputStream output = new FileOutputStream("/home/mburger/tmp/mfm/xxxx.pdf");

PdfFileEditor editor = new PdfFileEditor();
editor.extract(input, 1, 1, output);

output.flush();
output.close();
input.close();

The Output File has ALWAYS the same size as the input file, if my document has 10 pages and is 5MB big, I extract only the first page from the file, the result file has my desired PAGE but is still 5MB big!!! If I extract 2 pages the file has the same size as the extracted with one page!

can you help me?

michael.burgersiag.i · December 16, 2013, 8:43am

Same thing if I try to splitToPages …
any single page has the same size then the WHOLE PDF document

InputStream input = new FileInputStream("/home/mburger/tmp/mfm/207_2008_1_T4_2.pdf");

PdfFileEditor editor = new PdfFileEditor();

int i = 0;
for (ByteArrayOutputStream splitToPages : editor.splitToPages("/home/mburger/tmp/mfm/207_2008_1_T4_2.pdf")) {
i++;
System.out.println(i);
OutputStream outputStream = new FileOutputStream ("/home/mburger/tmp/mfm/x"+i+".pdf");
outputStream.write(splitToPages.toByteArray());
outputStream.flush();
outputStream.close();
};

codewarior · December 16, 2013, 8:44am

Hi Michael,

Thanks for contacting support.

Can you please share the source PDF file so that we can test the scenario at our end. We are sorry for this inconvenience.

michael.burgersiag.i · December 16, 2013, 8:46am

Same for
editor.splitFromFirst
!

michael.burgersiag.i · December 16, 2013, 9:12am

I opend a new Thread becouse I wasn’t able to set the thread to private!!

see here

http://www.aspose.com/community/forums/514596/extract-file-from-pdf-private/showthread.aspx#514596

michael.burgersiag.i · December 16, 2013, 9:43am

Incredible!

On Version 4.4 it doesn’t work at all!!!

Extract creates only an empty file!

That’s not the first time on passing version in AsposePdfKit from 4.2 to 4.4 nothing works…

I love AsposeWords!!!
but the Aspose Pdf Kit is very very very unstable and basic functions like working with attachments doesn’t work …

michael.burgersiag.i · December 17, 2013, 1:43am

I think I know the problem,

inserted images would not be saved on the PAGE but someone else in the PDF File (header or somehting like that, as attachments)
So on extracting a page … will extract me al included images.

Is there a way to say the PDF FILE delete anything you don’t use? Or delete images in file?

thx
Michael

michael.burgersiag.i · December 17, 2013, 3:07am

Ok I found many ways to extract the first page with the image on first page!
But now solution works, there are many errors … !!!

For example I can still using PdfFileEditor to extract the first page and delete the Images with PdfContentEditor!

BUT!

pdfContentEditor.deleteImages(1, new int[] {2,3,4,5,6});

doesn’t delete the 2., 3., 4., 5. and 6. image! Becouse your 2. image is my 1. image!!! But on Aspose Words I’m inserting the images sequencially … and on open the AsposeWord convertet PDF File I see my first image on first page!
Then on deleting it with deleteImages my first images is the 2. in your function!!!

one other bug …
I can do this:
pdfContentEditor.deleteImages(1, new int[] {2, 3});
but this raise an exception
pdfContentEditor.deleteImages(1, new int[] {2});
pdfContentEditor.deleteImages(1, new int[] {3});

Then there other exceptions on using your libs … I don’t have the time to explain them all …

I think the way I can resolve my problem is using
PdfExtractor …

If I do this
// Working solution
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(fileName);
extractor.extractImage();
extractor.getNextImage("/home/mburger/tmp/mfm/image1.pdf");
extractor.close();
// Working solution

It creates an PDF File with the extracted Image, and it seems it is always the right image (the first one and not randomize)

But The problem I have … The created PDF File is big then the extracted image … but I need a A4PDF file … so is there away to create a new A4 PDF File with the extracted image?

I can’t find anymore classes like com.apsos.pdf.Document!

thx
Michael

michael.burgersiag.i · December 17, 2013, 3:14am

OMG

// Working solution
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(fileName);
extractor.extractImage();
extractor.getNextImage("/home/mburger/tmp/mfm/image1.pdf");
extractor.close();
// Working solution

On version 4.2 I with my demo file I get the 1. first page
On version 4.4 I with my demo file i get THE LAST PAGE!!!

It is randomise!

codewarior · December 17, 2013, 10:36am

Hi Michael,

Thanks for contacting support.

In order to split the PDF file to Single page documents, I would recommend you to please follow the instructions specified over Split PDF File to Individual Pages

codewarior · December 17, 2013, 10:39am

michael.burger@siag.it:

Same for
editor.splitFromFirst
!

Hi Michael,

I would recommend you to please try following the approach specified over 514933.

codewarior · December 17, 2013, 10:44am

michael.burger@siag.it: Incredible!

On Version 4.4 it doesn’t work at all!!! Extract creates only an empty file!

That’s not the first time on passing version in AsposePdfKit from 4.2 to 4.4 nothing works…

I love AsposeWords!!! but the Aspose Pdf Kit is very very very unstable and basic functions like working with attachments doesn’t work …

Hi Michael,

Aspose.Pdf.Kit for Java has been discontinued as separate product and all its classes and enumerations are not present under com.aspose.pdf.facades package of autoported Aspose.Pdf for Java. We recommend you to please try using the latest release of Aspose.Pdf for Java 4.4.0 and in case you still face the same issue, please share some details with code snippet. We apologize for your inconvenience.

Now concerning to your point related to attachments, I have used the following code snippet to add an attachment to PDF file and as per my observations, the resultant file is properly being generated.

[Java]

//open first document
com.aspose.pdf.Document pdfDocument1 = new com.aspose.pdf.Document("c:/pdftest/source.PDF");

//setup new file to be added as attachment
com.aspose.pdf.FileSpecification fileSpecification
= new com.aspose.pdf.FileSpecification("c:/pdftest/Formatted_Test1.pdf", "Sample PDF file");

//add attachment to document's attachment collection
pdfDocument1.getEmbeddedFiles().add(fileSpecification);

// Save updated document containing table object
pdfDocument1.save("c:/pdftest/Attachment_output.pdf");

codewarior · December 17, 2013, 11:08am

michael.burger@siag.it: I think I know the problem, inserted images would not be saved on the PAGE but someone else in the PDF File (header or something like that, as attachments) So on extracting a page will extract al included images.

Is there a way for the PDF FILE to delete anything you don’t use? Or delete images in the file?

Hi Michael,

Aspose.Pdf for Java supports the feature to optimize the size of the PDF file but, unfortunately, it does not currently support the feature to remove unused objects from the PDF document. For the sake of implementation I have logged this as PDFNEWJAVA-33908 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us a little time. We are sorry for this inconvenience.

Java

// open first document
com.aspose.pdf.Document pdfDocument1 = new com.aspose.pdf.Document("c:/pdftest/demo.pdf");
// optimize the PDF file
pdfDocument1.optimizeResources();
// save updated document
pdfDocument1.save("c:/pdftest/Optimized.pdf")

codewarior · December 17, 2013, 12:00pm

michael.burger@siag.it:

pdfContentEditor.deleteImages(1, new int[] {2,3,4,5,6});

doesn’t delete the 2., 3., 4., 5. and 6. image! Because your 2. image is my 1. image!!! But on Aspose Words I’m inserting the images sequentially … and on opening the AsposeWord converted PDF File I see my first image on the first page!

Then on deleting it with deleteImages my first image is the 2. in your function!!!

one other bug …

I can do this:

pdfContentEditor.deleteImages(1, new int[] {2, 3});

but this raises an exception:

pdfContentEditor.deleteImages(1, new int[] {2});

pdfContentEditor.deleteImages(1, new int[] {3});

Hi,
Thanks for sharing the details.

I have tested the scenario using Aspose.Pdf for Java 4.4.0 where I have used the following code snippet with demo.pdf and I am unable to notice any problem when using the component in Eclipse Juno application running over Windows 7 (x64) where I have JDK 1.7.

[Java]

com.aspose.pdf.facades.PdfContentEditor editor = new com.aspose.pdf.facades.PdfContentEditor();
editor.bindPdf("c:/pdftest/demo.pdf");

// editor.deleteImage(1,  new int[]{1,2,});
editor.deleteImage(1, new int[] { 2 });
editor.deleteImage(1, new int[] { 3 });
editor.save("c:/pdftest/ImagesRemoved.pdf");

michael.burger@siag.it:

I think the way I can resolve my problem is using PdfExtractor …

If I do this
    // Working solution
    PdfExtractor extractor = new PdfExtractor();
    extractor.bindPdf(fileName);
    extractor.extractImage();
    extractor.getNextImage("/home/mburger/tmp/mfm/image1.pdf");
    extractor.close();
    // Working solution
It creates an PDF File with the extracted Image, and it seems it is always the right image (the first one and not randomized).

But The problem I have … The created PDF File is bigger than the extracted image … but I need an A4PDF file … so is there a way to create a new A4 PDF File with the extracted image?

In order to set the page size, please try using the following code snippet.

[Java]

// Instantiate PageEditor object
com.aspose.pdf.facades.PdfPageEditor page_editor = new com.aspose.pdf.facades.PdfPageEditor();
// Bind the source PDF file
page_editor.bindPdf("c:/pdftest/ImagesRemoved.pdf");
// Set the page size as A4
page_editor.setPageSize(com.aspose.pdf.facades.PageSize.getA4()); // (new com.aspose.pdf.facades.PageSize(, arg1))
// Save updated document
page_editor.save("c:/pdftest/A4PageSize.pdf");

michael.burger@siag.it:

I can’t find anymore classes like com.apsos.pdf.Document

The Document class is introduced in the api release starting from 4.0.0. Please try using the latest release of Aspose.Pdf for Java 4.4.0 and in case you still face any problem, please feel free to contact.

codewarior · December 17, 2013, 12:07pm

michael.burger@siag.it:

// Working solution
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(fileName);
extractor.extractImage();
extractor.getNextImage("/home/mburger/tmp/mfm/image1.pdf");
extractor.close();
// Working solution

On version 4.2 I with my demo file I get the 1. first page
On version 4.4 I with my demo file i get THE LAST PAGE!!!

It is randomise!

Hi Michael,

The com.aspose.pdf.facades.PdfExtractor class provides the feature to extract Text, Image and Attachments from PDF document. In case you need to get particular page from PDF file, please try using the following code snippet.

Java:`
//open first document

com.aspose.pdf.Document pdfDocument1 = new com.aspose.pdf.Document("c:/pdftest/demo.pdf");
// get the page at particular index of
//Page Collection
com.aspose.pdf.Page pdfPage =
pdfDocument1.getPages().get_Item(6);
// create a new Document object
com.aspose.pdf.Document newDocument = new com.aspose.pdf.Document();
// add page to pages collection of new
// document object
newDocument.getPages().add(pdfPage);
// save the newly generated PDF file
newDocument.save("c:/pdftest/page_"+ pdfPage.getNumber() + ".pdf");

michael.burgersiag.i · December 27, 2013, 7:21am

I found workaround (combination of specific version of Words+PDF), please follow this:

Re: Extract file from PDF Private - #13 by michael.burgersiag.i - Free Support Forum - aspose.com

codewarior · December 29, 2013, 12:35pm

Hi Michael,

I am glad to hear that your problem is resolved. Please continue using our products and in the event of any further query, please feel free to contact.

codewarior · February 21, 2014, 11:13pm

Hi Michael,

Thanks for your patience.

I am pleased to share that the feature to remove unused objects from PDF file is supported and its fix will be included in next release of Aspose.Pdf for Java 4.6.0 (which is planned to release in March-2014). In order to accomplish this requirement, please try using the following code snippet.

[Java]

com.aspose.pdf.Document doc = new Document("source.pdf");
OptimizationOptions opt = new Document.OptimizationOptions();
opt.setRemoveUnusedObjects(true);
doc.optimizeResources(opt);
doc.save("optimized.pdf");

codewarior · March 2, 2014, 11:34pm

Hi Michael,

Thanks for your patience. We have further investigated the issue reported earlier and as per our observations, the pages of the document use shared resources. That’s why all resources are included in the resultant files. In order to decrease the size, the customer should use the OptimizeResources() method.

Java:

String myDir = "D:\\";
Document pdfDocument1 = new Document(myDir + "36197.pdf");

// loop through all the pages
for (int pdfPage = 1; pdfPage <= 4; /*pdfDocument1.Pages.Count*/ pdfPage++) {
    // create a new Document object
    Document newDocument = new Document();

    // get the page at a particular index of the Page Collection
    newDocument.getPages().add(pdfDocument1.getPages().get_Item(pdfPage));

    // Optimize the newly created Document
    OptimizationOptions opt = new Document.OptimizationOptions();
    opt.setRemoveUnusedObjects(true);
    opt.setRemoveUnusedStreams(true);
    newDocument.optimizeResources(opt); // try to test with this line commented out

    // save the newly generated PDF file
    newDocument.save(myDir + pdfPage + "_test1.pdf");
}

aspose.notifier · March 10, 2014, 10:35pm

The issues you have found earlier (filed as PDFNEWJAVA-33908) have been fixed in Aspose.Pdf for Java 4.6.0.