TextAbsorber - How to parse \r\n?

marcoaraujolsys · July 19, 2014, 9:10am

Hello,

I'm using Aspose.Pdf (version 9.3.0.0) to read a PDF file and examine its contents. But get the text through TextAbsorber method, it is unclear when it is a single line break '\r\n' or when it is a breach of several lines '\r\n\r\n' or worse .... when is a line break '\r\n' due to a column that has reached its limit of characters (in this case I should replace with space).

Is there a pattern I can replace these '\r\n' ?

Following sample text:

"Tribunal Regional \r\n\r\n\r\n\r\n Severino Rodrigues dos Santos\r\n\r\n Presidente \r\n\r\n João Leite de Arruda Alencar \r\n Vice-Presidente \r\n\r\n Avenida da Paz, 2076 \r\n\r\n Centro\r\n Maceió/AL \r\n \r\n CEP: 57020440 \r\n \r\n Telefone(s) : (82) 2121 8299 \r\n\r\n\r\n\r\n \r\n Secretaria Judiciária \r\n

\r\n Acórdão \r\n Publicação de Acórdão \r\n Processo Nº MS-0000300-89.2012.5.19.0000 \r\n Processo Nº MS-00300/2012-000-19-00.7 \r\n\r\n Relator ANNE INOJOSA\r\n Revisor ANTÔNIO CATÃO \r\n Redator ANNE INOJOSA \r\n Impetrante(s) J. A. J. MOTEIS LTDA. - EPP \r\n Advogado FABRICIO SIQUEIRA DE \r\n MIRANDA(OAB: 8278AL) \r\n Impetrado(s) JUIZ DAS EXECUCOES DAS VARAS \r\n DO TRABALHO DE MACEIO\r\n\r\n Litisconsorte(s) ANTONIO JOSE BARROS\r\n\r\n Litisconsorte(s) MARCOS ANTONIO DE CARVALHO\r\n RIBEIRO \r\n Litisconsorte(s) MARIA NAILCE

TENORIO RIBEIRO \r\n\r\n Procedência: Trt 19ª Região - Maceió/Al\r\n\r\n EMENTA:\r\n\r\n PEDIDO DE SUBSTITUIÇÃO DE BENS. INDEFERIMENTO. Tendo\r\n\r\n em vista que o impetrante expandiu seu negócio, utilizando-se das\r\n\r\n dependências em que antes se encontrava estabelecido o Motel\r\n\r\n L´amore (Hotéis Rotativos de Maceió Ltda.), atuando no mesmo\r\n\r\n ramo de atividade deste, não há como fazer cair por terra o\r\n\r\n entendimento abraçado pelo Juízo das execuções, no sentido de\r\n\r\n que houve sucessão de empregadores, cabendo-se inclusive\r\n\r\n registrar que tal entendimento não foi modificado por esta Corte,\r\n\r\n quando do pronunciamento referente ao agravo de petição outrora\r\n\r\n interposto pelo ora impertrante. Registre-se ainda que cabe ao\r\n\r\n Judiciário ingressar no patrimônio do sucessor, a fim de conseguir\r\n\r\n recursos para findar a execução e, desta forma, entregar a\r\n\r\n prestação jurisdicional em sua integralidade, em obediência ao\r\n\r\n disposto no a

rt. 5º, inc. LXXVIII, da CRFB/88. Logo, restando\r\n\r\n correta a constrição judicial sobre bens do impetrante, através da\r\n\r\nCódigo para aferir autenticidade deste caderno: 73552"

codewarior · July 21, 2014, 7:18am

Hi Marco,

Thanks for contacting support.

As per my understanding, you are trying to extract the text from PDF file and facing an issue or manually determining the line break inside extracted contents and placing/adding \n\r characters. Please share some further details regarding your requirement, the source PDF file and code snippet which you are using, so that we can test the scenario at our end. We are sorry for this inconvenience.

cadoria · July 21, 2014, 11:42am

Hello,

I just need to differentiate a break (end of line) header line break line when the text reaches the edge of the page and the text starts on the next page (it is NOT the end of the line really). And both have the same control line break '\r\n\r\n' !!

cadoria · July 21, 2014, 11:47am

Hello,

I just need to differentiate a break (end of line) header line break line when the text reaches the edge of the page and the text starts on the next page (it is NOT the end of the line really). And both have the same control line break '\r\n\r\n' !!

Code:

string dataDir = Path.GetFullPath(@"C:\Projetos\CriarWord\Word\Word\");

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + "TRT_MColumn.pdf");

TextAbsorber textAbsorber = new TextAbsorber();

pdfDocument.Pages.Accept(textAbsorber);

string linha = "";

for (int page = 1; page <= pdfDocument.Pages.Count; page++)

{

pdfDocument.Pages[page].Accept(textAbsorber);

linha = textAbsorber.Text;

//....

}

codewarior · July 22, 2014, 7:19am

Hi Marco,

Thanks for sharing the details.

As per my understanding, you are trying to find the point/location where page break occurs and page contents are displayed on subsequent page. You do not need to find/identify the line break. Can you please acknowledge that I have properly understood your requirement, so we may reply accordingly.

cadoria · July 23, 2014, 7:48am

Do not,

I'm trying to identify two situations (for PDF with 2 columns or more):

1) The deliberate line break (like a small header row) + ENTER

2) A line that goes to the limit and generates an "ENTER" to proceed to the next line.

For the "2" position I shall replace this "enter" for "space" is only a continuation of the text on another line (not a real newline!)

How to differentiate the "ENTER" on items 1 and 2?

Thanks

codewarior · July 24, 2014, 5:04am

Hi Marco,

Thanks for sharing the details. Let me rephrase what I have understood.

You need to identify the points/position where an Enter was added/pressed to create a line break and identify the point where contents of line are moved to subsequent line, when contents are reached the right edge of page (automatic word wrapping).

cadoria · July 24, 2014, 3:00pm

Exact.

Because the "automatic word wrapping"'ll need to change "space" to join the line.

And ENTER "pressed" no this need.

How is it possible to distinguish these two situations for me to change the correct line breaks?

Att,

codewarior · July 25, 2014, 7:15am

Hi Marco,

Thanks for sharing the details.

I have logged the above stated requirement as PDFNEWNET-37251 in our issue tracking system. We will further look into the details of this requirement and will keep you posted on the status of correction. Please be patient and spare us little time.

aspose.notifier · April 13, 2018, 6:53pm

The issues you have found earlier (filed as PDFNET-37251) have been fixed in Aspose.PDF for .NET 18.4. This message was posted using BugNotificationTool from Downloads module by asad.ali