How to access a stream inside a PDF?

Hi,

We have a PDF that has an internal RTF stream. I can’t find how to access it with Aspose, no matter what object I look at I can’t find the RTF data.

I have attached a sample PDF PDF.pdf (5.4 KB)

When I inspect the PDF online here (Inspect PDF file - Free online tool - pdfux) I can see there is RTF data in a stream:

image.png (18.0 KB)

Aspose.PDF doesn’t seem to recognise it at all. Is there a way to see this information?

@GaryO
There are no available (public) API methods for this.
I thought about whether it is necessary to create an enhacement task for this case.
Tell me how did you find out that there is a stream in the document whose content is an rtf file?
Do you need the possibility of batch, automatic processing, or such a need only for this file?
I am attaching the resulting rtf file. Trn9B49.zip (899 Bytes)

Hi @sergei.shibanov ,

Yes the RTF extracted is what was expected and what I can see from pdfux.com.

I think it would be useful to expose this in the public API (both get and set if possible). We need to process PDF documents from a variety of sources and in this case we need to detect/remove if there’s extra data in the PDF with potential security concerns (the sample I sent is very basic, but as you can see the RTF stream actually has tracking information (username and dates). The PDF in this case is created from Nuance PowerPDF and they store this when they use Advanced Editing and Track Changes.

@GaryO
How do you see the additional API?
Something like:
bool Page… CheckPotentiallyDangerousContent()
void Page… DeletePotentiallyDangerousContent()
Or do you also want the ability to save this data?

Closest to this feature
Pages.Resources and Pages.Artifacts - it is even possible to add what you want there, just like it was done there.

@sergei.shibanov

I haven’t used Pages.Artifacts or Pages.Resources but being collections I’d say something similar would be good.

I think ideally exposing the streams in a collection would give the greatest flexibility. Something like Pages.UnsupportedStreams (or UnknownStreams or NonStandardStreams) with the ability to read the stream somehow. Then add the usual Add/Update/Clear functions.

I think leaving it generic is the best way forward (ie not specific to our case). That way if anyone comes across any other unusual streams at least they can develop their own logic to handle it.

The only reason this has come up is that the customer request was to remove the track changes info and after spending time trying to determine how PowerPDF stores the info programmatically with no luck, I tried pdfux and was able to see it’s a separate stream. But I don’t think it needs to be specific to this case as I’m sure we’d like to know if there’s data stored in the PDF that we can’t normally see.

@GaryO
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55141

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@GaryO
Thank you for not only describing your need, but also for summarizing your vision.
I agree with it and set the task for the development team to implement this feature.

@sergei.shibanov Is there any ETA on PDFNET-55141 being completed? Thanks.

@dnewt
Nothing new for this task yet. Created tasks are solved in the order they are received, taking into account priorities.
The highest priority is for tasks with paid support, followed by tasks from users who have purchased a license.
The time it takes to solve problems can also vary. Therefore, unfortunately, it is not even possible to give ETA.