hey , I have a requirement where I can stream multiple large pdf from S3 and merge them per page at a time. Can you guide on how this can be achieved ? Does Aspose PDF java offers this functionality
I used this logic , but this will fail when large pdfs come into the picture because of High JVM memory
try (Document mergedDoc = new Document()) {
pdfFileNames.stream()
.forEach(pdfName -> {
Path pdfPath = RESOURCE_PDF_DIR.resolve(pdfName);
try (Document srcDoc = new Document(pdfPath.toString())) {
mergedDoc.getPages().add(srcDoc.getPages());
OutlineItemCollection bookmark = new OutlineItemCollection(mergedDoc.getOutlines());
bookmark.setTitle(Strings.CI.removeEnd(pdfName, ".pdf"));
bookmark.setBold(true);
bookmark.setItalic(false);
int firstPageIndex = mergedDoc.getPages().size() - srcDoc.getPages().size() + 1;
bookmark.setAction(new GoToAction(mergedDoc.getPages().get_Item(firstPageIndex)));
mergedDoc.getOutlines().add(bookmark);
}
@chetanepiq
To merge multiple large PDF files from S3 without encountering high JVM memory issues, you can utilize Aspose.PDF’s streaming capabilities. This approach allows you to process the documents page by page instead of loading entire files into memory. Here’s a step-by-step method to achieve this:
- Stream the PDF files from S3: Use an input stream to read the PDF files directly from S3.
- Merge them page by page: Create a new document and add pages one at a time from each source document.
Here’s a sample code snippet to illustrate this:
import com.aspose.pdf.Document;
import com.aspose.pdf.Page;
import java.io.InputStream;
import java.util.List;
public class PdfMerger {
public static void mergePdfs(List<InputStream> pdfStreams, String outputFilePath) {
try (Document mergedDoc = new Document()) {
for (InputStream pdfStream : pdfStreams) {
try (Document srcDoc = new Document(pdfStream)) {
for (int i = 1; i <= srcDoc.getPages().size(); i++) {
Page page = srcDoc.getPages().get_Item(i);
mergedDoc.getPages().add(page);
}
}
}
mergedDoc.save(outputFilePath);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Key Points:
- Replace the
List<InputStream> pdfStreams
with your logic to obtain input streams from S3.
- This method processes each page individually, which effectively manages memory usage.
This approach should help you merge large PDFs without running into memory issues.
Any idea how this can be achieved , i dont think we can use page by page stream on pdfs , is there any way to escape jvm heap memory issue ? I am using this code
pdfFileNames.stream()
.forEach(pdfName -> {
Path pdfPath = RESOURCE_PDF_DIR.resolve(pdfName);
try (Document srcDoc = new Document(pdfPath.toString())) {
mergedDoc.getPages().add(srcDoc.getPages());
Here the merging is causing memory issue , can we do something here so memory spike issue can be resolved ?
@chetanepiq
Once a document is initialized either using stream or file path, you can access its pages in a loop to extract them and add them into another PDF document (mergedDoc
in your case). You can split the concatenation process on page level to lower the memory usage and avoid memory spikes. Kind of below modifications in your code can help you in resolving the issue:
pdfFileNames.stream()
.forEach(pdfName -> {
Path pdfPath = RESOURCE_PDF_DIR.resolve(pdfName);
try (Document srcDoc = new Document(pdfPath.toString())) {
for (Page page : srcDoc.getPages()) {
mergedDoc.getPages().add(page);
}
} catch (Exception e) {
e.printStackTrace();
}
});