How to turn a pdf into a text searchable pdf?

Multi tool use
up vote
11
down vote
favorite
I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?
Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).
pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
pdfsandwich (of which the software center says it is a poor package and I should not install it)- OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
Gscan2pdf exports an all black (but searchable) image as reported in this discussion.- I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.
Is there a software package I am unaware of? Or a script that does this?
software-recommendation pdf ocr
add a comment |
up vote
11
down vote
favorite
I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?
Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).
pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
pdfsandwich (of which the software center says it is a poor package and I should not install it)- OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
Gscan2pdf exports an all black (but searchable) image as reported in this discussion.- I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.
Is there a software package I am unaware of? Or a script that does this?
software-recommendation pdf ocr
3
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22
add a comment |
up vote
11
down vote
favorite
up vote
11
down vote
favorite
I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?
Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).
pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
pdfsandwich (of which the software center says it is a poor package and I should not install it)- OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
Gscan2pdf exports an all black (but searchable) image as reported in this discussion.- I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.
Is there a software package I am unaware of? Or a script that does this?
software-recommendation pdf ocr
I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?
Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).
pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
pdfsandwich (of which the software center says it is a poor package and I should not install it)- OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
Gscan2pdf exports an all black (but searchable) image as reported in this discussion.- I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.
Is there a software package I am unaware of? Or a script that does this?
software-recommendation pdf ocr
software-recommendation pdf ocr
edited Apr 13 '17 at 12:24
Community♦
1
1
asked May 29 '14 at 9:37


don.joey
17.1k126394
17.1k126394
3
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22
add a comment |
3
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22
3
3
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22
add a comment |
4 Answers
4
active
oldest
votes
up vote
7
down vote
accepted
Ubuntu < 16.04
Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.
git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage
If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):
sudo apt-get install parallel
sudo rm /etc/parallel/config
Finally you can OCR your pdf with the command:
sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
Ubuntu >= 16.04
As of Ubuntu 16.04 OCRmyPDF has become available true apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Justsudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04
– Martin Thoma
Aug 14 '17 at 20:39
1
For Ubuntu 16.10 and later, you can just dosudo apt install ocrmypdf
.
– endolith
Feb 26 at 16:46
add a comment |
up vote
2
down vote
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
add a comment |
up vote
2
down vote
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
add a comment |
up vote
0
down vote
OCRfeeder has a bug in
/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py
line 436 should read:
lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())
changed this and it worked for me
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
Ubuntu < 16.04
Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.
git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage
If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):
sudo apt-get install parallel
sudo rm /etc/parallel/config
Finally you can OCR your pdf with the command:
sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
Ubuntu >= 16.04
As of Ubuntu 16.04 OCRmyPDF has become available true apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Justsudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04
– Martin Thoma
Aug 14 '17 at 20:39
1
For Ubuntu 16.10 and later, you can just dosudo apt install ocrmypdf
.
– endolith
Feb 26 at 16:46
add a comment |
up vote
7
down vote
accepted
Ubuntu < 16.04
Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.
git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage
If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):
sudo apt-get install parallel
sudo rm /etc/parallel/config
Finally you can OCR your pdf with the command:
sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
Ubuntu >= 16.04
As of Ubuntu 16.04 OCRmyPDF has become available true apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Justsudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04
– Martin Thoma
Aug 14 '17 at 20:39
1
For Ubuntu 16.10 and later, you can just dosudo apt install ocrmypdf
.
– endolith
Feb 26 at 16:46
add a comment |
up vote
7
down vote
accepted
up vote
7
down vote
accepted
Ubuntu < 16.04
Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.
git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage
If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):
sudo apt-get install parallel
sudo rm /etc/parallel/config
Finally you can OCR your pdf with the command:
sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
Ubuntu >= 16.04
As of Ubuntu 16.04 OCRmyPDF has become available true apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
Ubuntu < 16.04
Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.
git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage
If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):
sudo apt-get install parallel
sudo rm /etc/parallel/config
Finally you can OCR your pdf with the command:
sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
Ubuntu >= 16.04
As of Ubuntu 16.04 OCRmyPDF has become available true apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
edited Nov 30 at 16:23
airdas
1356
1356
answered May 30 '14 at 8:20


don.joey
17.1k126394
17.1k126394
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Justsudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04
– Martin Thoma
Aug 14 '17 at 20:39
1
For Ubuntu 16.10 and later, you can just dosudo apt install ocrmypdf
.
– endolith
Feb 26 at 16:46
add a comment |
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Justsudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04
– Martin Thoma
Aug 14 '17 at 20:39
1
For Ubuntu 16.10 and later, you can just dosudo apt install ocrmypdf
.
– endolith
Feb 26 at 16:46
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)
– Registered User
Jun 19 '14 at 13:37
Just
sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04– Martin Thoma
Aug 14 '17 at 20:39
Just
sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF
for Ubuntu 16.04– Martin Thoma
Aug 14 '17 at 20:39
1
1
For Ubuntu 16.10 and later, you can just do
sudo apt install ocrmypdf
.– endolith
Feb 26 at 16:46
For Ubuntu 16.10 and later, you can just do
sudo apt install ocrmypdf
.– endolith
Feb 26 at 16:46
add a comment |
up vote
2
down vote
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
add a comment |
up vote
2
down vote
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
add a comment |
up vote
2
down vote
up vote
2
down vote
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.
edited Oct 10 '15 at 12:44


Nephente
3,73611020
3,73611020
answered Jul 24 '14 at 14:29
Tobias Elze
19913
19913
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
add a comment |
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?
– Valentas
Dec 16 '16 at 16:04
1
1
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.
– Tobias Elze
Jan 17 '17 at 1:39
add a comment |
up vote
2
down vote
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
add a comment |
up vote
2
down vote
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
add a comment |
up vote
2
down vote
up vote
2
down vote
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
answered Feb 11 '17 at 21:05


Ludenticus
21518
21518
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
add a comment |
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
Have my upvote sir!
– don.joey
Feb 13 '17 at 8:36
add a comment |
up vote
0
down vote
OCRfeeder has a bug in
/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py
line 436 should read:
lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())
changed this and it worked for me
add a comment |
up vote
0
down vote
OCRfeeder has a bug in
/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py
line 436 should read:
lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())
changed this and it worked for me
add a comment |
up vote
0
down vote
up vote
0
down vote
OCRfeeder has a bug in
/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py
line 436 should read:
lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())
changed this and it worked for me
OCRfeeder has a bug in
/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py
line 436 should read:
lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())
changed this and it worked for me
answered Jan 9 '17 at 22:24
AndreR
1
1
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f473843%2fhow-to-turn-a-pdf-into-a-text-searchable-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
v7ZlBJFwuf96MYGIeEkv d4tq2EGd BWUSVuI5Djco0nvI,AmWy,F7rOP hA TBoMDNQ 4 Yf 1NPSdv xg yIVX0myf,lB
3
I haven't tried it out myself, yet, but I've seen this project get recommended in the past.
– Glutanimate
May 29 '14 at 21:22