Content transformer for PDF

ForumsConfiguration

hello
i am trying to make a transformer from Scanned Pdf to a text PDF , the transformer should be loaded automaticly if a pdf is uploaded in alfresco .
im using tesseract and alfresco 5.0

after some research i have found a Post in the seedim forum that explains how to do that http://www.seedim.com.au/content/alfresco-search-pdf-images-using-transformations-and-tesseract-ocr

i first added a transformer in /opt/alfresco-community/tomcat/shared/classes/alfresco/extension/
named PDFimage-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN''http://www.springframework.org/dtd/spring-beans.dtd'>
 
<beans>
<bean id="transformer.worker.pdfimg2ocrtxt"class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService"/>
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>ls</value>
<value>/opt/alfresco-community/pdf.sh</value>
</list>
</entry>
</map>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>/opt/alfresco-community/pdf.sh</value>
<value>${source}</value>
<value>${target}</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2,3</value>
</property>
</bean>
</property>
</bean>
 
<bean id="transformer.pdfimg2ocrtxt"class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.worker.pdfimg2ocrtxt"/>
</property>
</bean>
</beans>

i made a script code that i placed in /opt/alfresco-community
the script works fine when i lunch it from the terminal

#!/bin/bash
 

SOURCE=$1

TARGET=$2

TMPDIR=/home/yosri/tmp

name=yosri

TEMP_PDFTXT_FILE=$TMPDIR/pdftext.txt

echo running command "pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE"

pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE

FILESIZE=$(stat -c%s "$TEMP_PDFTXT_FILE")

echo "Size of $TEMP_PDFTXT_FILE = $FILESIZE bytes.">>/home/yosri/logfile.txt
 

# if file exists and has a size bigger than 0 then set wordlist as result of transformation and exit.
if[-s $TEMP_PDFTXT_FILE ]; then

    echo Found wordlist from in $TEMP_PDFTXT_FILE >>/home/yosri/logfile.txt

    cat $TEMP_PDFTXT_FILE >> $TARGET

    rm -rf $TMPDIR/$name

    exit 0;

fi

# splitting to individual pages

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4-o out_%04d.jpg-f $SOURCE

# process each page
for f in $( ls *.jpg);do

  # extract text

  tesseract $f $TMPDIR/${f%.*}-l eng

  cat $TMPDIR/${f%.*}.txt>> $TMPDIR/res.txt

  rm -f $TMPDIR/${f%.*}.txt

  rm -f $f

done
 

#combine all pages back to a ${TARGET}

cat $TMPDIR/res.txt>> $TARGET

and finally i added the priority line modification on the alfresco-global.properties

content.transformer.pdfimg2ocrtxt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.supported=true
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.maxSourceSizeKBytes.use.index=9999

but when i upload a pdf with images it still not indexed .
i added some extra code to the transformer to verify if its loaded and the alfresco dont run so it is loaded .
can any one help me plz did i miss sth ?

Alfresco Version5.0

TagsOCR

Content transformer for PDF

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List