How to remove a watermark from PDF input

DougT · May 12, 2022, 4:01am

Hi,

I am having issue to remove watermark from a PDF converted from Aspose.Words.

Here is the code I used to remove the watermark from Docx input file. but the same file when saved as Pdf using Aspose.Words. The watermark is not removed.

Here is my current code works for my document to remove watermark from Docx input.

public static void RemoveWatermark(Document doc)
{
    // This code doesn't work for my Docx inputs, and throw a null reference error when deal with PDF input
    //var watermark = doc.Watermark;

    //if (watermark.Type != WatermarkType.None)
    //{
    //    watermark.Remove();
    //}

    foreach (Section sec in doc.Sections)
    {
        foreach (HeaderFooter headerFooter in sec.HeadersFooters)
        {
            foreach (Shape shape in headerFooter.GetChildNodes(NodeType.Shape, true))
            {
                if (shape.IsImage && (shape.AllowOverlap && shape.BehindText))
                {
                    shape.Remove();
                }
            }
        }
    }
}

I am using Aspose 22.5

BuildDemo.docx (64.7 KB)
BuildDemo.pdf (66.1 KB)

Cheers

alexey.noskov · May 12, 2022, 5:44am

@DougT MS Word document and PDF document are different in their structure. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. But PDF documents are fixed page format documents. In PDF there is no such element as Header or Footer, watermark is simply rendered under the content on each page. So there is no way to preserve watermark as watermark after Aspose.Words DOM->PDF->Aspose.Words DOM.

DougT · May 12, 2022, 6:12am

Thanks Alexey, is there any way, like go through the document body, or meta data to remote any shape object with particular property? In my case, the watermarks are quite large in size?

alexey.noskov · May 12, 2022, 6:44am

@DougT You can try removing shapes that are placed behind text. For example see the following code:

Document doc = new Document(@"C:\Temp\in.pdf");

NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
foreach (Shape s in shapes)
{
    if (s.BehindText)
        s.Remove();
}

doc.Save(@"C:\Temp\out.docx");

DougT · May 12, 2022, 8:29am

Thanks Alexey, still not working. I’ve also tried

NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
foreach (Shape s in shapes)
{
       //if (s.BehindText)
            s.Remove();
}

Same result, so it’s not a shape, doesn’t make sense to me?

alexey.noskov · May 12, 2022, 8:52am

@DougT Do you use the same PDF document as you have attached earlier as an input file? If not, could you please attach the real source document?

DougT · May 12, 2022, 9:29pm

Thank Alexey.

yes it’s the same PDF document, I did a bit experiment, and it turns out if I output the Words document, the watermark is removed, but I need it saved into Image file, Jpeg in my case, which still have the watermark.

I’ve uploaded a demo project file.ConsoleApp1.zip (8.6 MB)

alexey.noskov · May 13, 2022, 5:48am

@DougT The problem occurs because page layout is cached when you load document from PDF. So changes made to the model are not reflected when you save the document in fixed page format (image in your case). You can fix this by updating page layout before saving:

private static void RemoveWatermark(string docName)
{
    var doc = new Document(docName);

    NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
    foreach (Shape s in shapes)
    {
        if (s.BehindText)
            s.Remove();
    }

    doc.UpdatePageLayout();
    doc.Save(docName.Replace("pdf", "jpg"));
    doc.Save(docName.Replace("pdf", "docx"));
}

DougT · May 13, 2022, 8:15am

Thanks Alexey, that fixed the issue.

So to rephrase you: if the input document is PDF, I should call the UpdatePageLayout, if I don’t call it, it might not working for some case (fixed page format).

So is it general a good idea to call it for PDF input regardless output format, if the document object is modified, to make sure the page layout is updated?

alexey.noskov · May 13, 2022, 8:33am

@DougT Updating document page layout is quite resource consuming operation. So if you save in flow format it is not required to call it. If you save the document to fixed page formats, calling Document.UpdatePageLayout will not affect performance, because Aspose.Words either internally call this method or use cached layout build when you call Document.UpdatePageLayout.

DougT · May 13, 2022, 8:43am

Hi Alexey, not sure I get the part where it will not affect performance, are you saying internally call this method on the fixed page formats is different from the other cases of the update document page layout (hence not necessary as much resource consuming), and similarly use cached layout build is different from the other case where it is considered resource consuming?

alexey.noskov · May 13, 2022, 8:56am

@DougT If you load document from flow format (DOC, DOCX, RTF, HTML etc) the document does not have layout. To convert the document to fixed page format (PDF, Image, XPS, SVG etc) Aspose.Words builds document layout. So the following code will internally call Document.UpdatePageLaytout:

Document doc = new Document("in.docx");
doc.Save("out.pdf"); // Here Aspose.Words will call Document.UpdatePageLaytout internally.

If you before saving call Document.UpdatePageLaytout, Aspose.Words will use cached layout:

Document doc = new Document("in.docx");
doc.UpdatePageLaytout(); 
doc.Save("out.pdf"); // Document.UpdatePageLaytout is NOT called internally. Cached layout is used.

In your case you load the document from PDF and layout is cached and changes made in the document are not reflected in the cached layout. For example see the following simple code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);

builder.Writeln("This is test paragraph");
builder.Writeln("This is test paragraph");
builder.Writeln("This is test paragraph");

// Layout is cached here.
doc.UpdatePageLayout();

// Changes made to the model will not be reflected in the cached layout.
doc.FirstSection.Body.FirstParagraph.Remove();

doc.Save(@"C:\Temp\out.pdf"); // 3 paragraphs will be rendered.

doc.UpdatePageLayout();

doc.Save(@"C:\Temp\out_updated.pdf"); // 2 paragraphs will be rendered.

DougT · May 13, 2022, 9:03am

Think I’ve got it Alexey, thanks for the explanation.