Create a new page from text file using Form Feed

dschak · May 14, 2018, 11:56pm

Hi,

I am wanting to convert text reports to PDF. I have most of the conversion stuff done such as changing to landscape and setting the font to a fixed font (Courier New).

However, during the conversion tabs and form feed (new page) are being ignored.

Without having to read each line and then decide if I need to create a new PDF page, is it possible for either the TextFragment or the Pdf.Page to be told to create a new page on Form Feed (ASCII 12).

I suspect that the missing tabs might be a more difficult issue.

Code from your documentation:
TextFragment text = new TextFragment(tr.ReadToEnd());
page.Paragraphs.Add(text);

Thank you.

imran.rafique · May 15, 2018, 8:01am

@dschak,

The Rectangle member of TextFragment instance, helps to retrieve the rectangle position of text on the PDF page. You can also retrieve the rectangle coordinates of page with the Rect member of the Page instance for comparison purposes, and then decide to add a new page. In order to add an empty page, you can call Add method of the PageCollection class as follows:
C#

Document pdfDocument = new Document(dataDir + "input.pdf");
pdfDocument.Pages.Add();

If this does not help, then kindly send the complete code along with the form feed data and an expected output PDF. We will investigate your scenario in our environment, and share our findings with you.

dschak · May 15, 2018, 8:03pm

I understand how to add a new page

What I want to happen is for a new page to be automatically created when the new page character (Form Feed) ASCII Char(12) is found in the text file.

The text files in question are old reports that I want to convert to PDF.

Thanks.

imran.rafique · May 16, 2018, 1:30am

@dschak,

We are afraid that it is not supported. An enhancement has been logged under the ticket ID PDFNET-44710 in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. As a workaround, you can insert text into the PDF document, and find the rectangle position of each ASCII Char(12). The rectangle position of each ASCII character can help you to add page breaks as follows:
C#

Document doc = new Document("source.pdf");
Document dest = new Document();
PdfFileEditor fileEditor = new PdfFileEditor();
fileEditor.AddPageBreak(doc, dest, new PdfFileEditor.PageBreak[] { new PdfFileEditor.PageBreak(1, 450) });
dest.Save("dest.pdf");

The Rectangle member of TextFragment instance, returns the rectangle position of the text. Please refer to this help topic: Search and Get Text from Pages of a PDF Document

dschak · May 16, 2018, 2:23am

Okay I need a bit more detail here.

I think that I can find the form feeds with
Aspose.Pdf.Text.TextFragmentAbsorber tfa = new Aspose.Pdf.Text.TextFragmentAbsorber(’\12’);
assuming that the initial text ingestion did not do anything with the charater.

Then how do I use that in the PdfFileEditor?

imran.rafique · May 16, 2018, 11:41am

@dschak,

Please try the source code as follows:
C#

string dataDir = @"C:\Pdf\test830\";
Document doc = new Document(dataDir + "input.pdf");
// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Split");
// Accept the absorber for all the pages
doc.Pages.Accept(textFragmentAbsorber);
// Get the first extracted text fragment by index
TextFragment textFragment = textFragmentAbsorber.TextFragments[1];
Document dest = new Document();
PdfFileEditor fileEditor = new PdfFileEditor();
fileEditor.AddPageBreak(doc, dest, new PdfFileEditor.PageBreak[] { new PdfFileEditor.PageBreak(textFragment.Page.Number, textFragment.Rectangle.URY) });
dest.Save(dataDir + "output.pdf");

This is the ZIP of input and output PDF documents: files.zip (172.4 KB)

dschak · May 17, 2018, 4:49am

Hi Imran,

That code does work to a limited extent, however I still have 2 issues.

The first is, that to process all of the page breaks in a report, think 100+, the PdfFileEditor is going to be VERY slow and messy reprocessing the document for every page from source to destination. Is there a way to do all at once?

The second is related to the form feed character. I don’t seem to be able to find it using the absorber. Attached are 2 sample files, the original text and the generated pdf from it, using the code below

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document();
Aspose.Pdf.Page pdfPage = pdfDocument.Pages.Add;
pdfPage.PageInfo.IsLandscape = true;
StreamReader sr = new StreamReader(reportFilename);
Aspose.Text.TextFragment pdfText = new Aspose.Text.TextFragment(sr.ReadToEnd);
pdfPage.Paragraphs.Add(pdfText);

If I use

Aspose.Pdf.Text.TextFragmentAbsorber tfa = new Aspose.Pdf.Text.TextFragmentAbsorber(“Date”);
pdfDocument.Pages.Accept(tfa);

I get 4 fragments (as expected), however if I change the “Date” to ‘\12’ I get zero fragments. So did the FF get loaded from the txt to pdf file and if so, how do I locate it with the TFA?

Thanks.Report.zip (32.7 KB)

imran.rafique · May 17, 2018, 10:40am

@dschak,

We have recorded both issues under the same ticket ID PDFNET-44710. In order to process 100+ page splits, you can maintain hash table, and if the process is slow, then we recommend to share all details of the scenario, so that we could replicate the same performance issue in our environment. This will help us to find the root cause and fix the problem. After the fix of ticket ID PDFNET-44710, you will be able to split page for each ASCII page break character at once.