Quantcast
Channel: Alfresco Forums - Configuration
Viewing all articles
Browse latest Browse all 411

Content transformer for PDF

$
0
0

hello
i am trying to make a transformer from Scanned Pdf to a text PDF , the transformer should be loaded automaticly if a pdf is uploaded in alfresco .
im using tesseract and alfresco 5.0

after some research i have found a Post in the seedim forum that explains how to do that http://www.seedim.com.au/content/alfresco-search-pdf-images-using-transformations-and-tesseract-ocr

i first added a transformer in /opt/alfresco-community/tomcat/shared/classes/alfresco/extension/
named PDFimage-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN''http://www.springframework.org/dtd/spring-beans.dtd'>
 
<beans>
<bean id="transformer.worker.pdfimg2ocrtxt"class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService"/>
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>ls</value>
<value>/opt/alfresco-community/pdf.sh</value>
</list>
</entry>
</map>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>/opt/alfresco-community/pdf.sh</value>
<value>${source}</value>
<value>${target}</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2,3</value>
</property>
</bean>
</property>
</bean>
 
<bean id="transformer.pdfimg2ocrtxt"class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.worker.pdfimg2ocrtxt"/>
</property>
</bean>
</beans>

i made a script code that i placed in /opt/alfresco-community
the script works fine when i lunch it from the terminal
#!/bin/bash
 
SOURCE=$1
TARGET=$2
TMPDIR=/home/yosri/tmp
name=yosri
TEMP_PDFTXT_FILE=$TMPDIR/pdftext.txt
echo running command "pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE"
pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE
FILESIZE=$(stat -c%s "$TEMP_PDFTXT_FILE")
echo "Size of $TEMP_PDFTXT_FILE = $FILESIZE bytes.">>/home/yosri/logfile.txt
 
# if file exists and has a size bigger than 0 then set wordlist as result of transformation and exit.
if[-s $TEMP_PDFTXT_FILE ]; then
echo Found wordlist from in $TEMP_PDFTXT_FILE >>/home/yosri/logfile.txt
cat $TEMP_PDFTXT_FILE >> $TARGET
rm -rf $TMPDIR/$name
exit 0;
fi
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4-o out_%04d.jpg-f $SOURCE
# process each page
for f in $( ls *.jpg);do
# extract text
tesseract $f $TMPDIR/${f%.*}-l eng
cat $TMPDIR/${f%.*}.txt>> $TMPDIR/res.txt
rm -f $TMPDIR/${f%.*}.txt
rm -f $f
done
 
#combine all pages back to a ${TARGET}
cat $TMPDIR/res.txt>> $TARGET

and finally i added the priority line modification on the alfresco-global.properties

content.transformer.pdfimg2ocrtxt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.supported=true
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.maxSourceSizeKBytes.use.index=9999

but when i upload a pdf with images it still not indexed .
i added some extra code to the transformer to verify if its loaded and the alfresco dont run so it is loaded .
can any one help me plz did i miss sth ?

5.0
OCR

Viewing all articles
Browse latest Browse all 411

Trending Articles