how to extract text from pdf with embedded subset fonts












1















Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?










share|improve this question





























    1















    Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?










    share|improve this question



























      1












      1








      1








      Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?










      share|improve this question
















      Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?







      pdf embedded-fonts xpdf






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 16 '15 at 7:01









      Der Hochstapler

      68k50230286




      68k50230286










      asked Oct 8 '13 at 9:20









      Nishanth LawrenceNishanth Lawrence

      10615




      10615






















          2 Answers
          2






          active

          oldest

          votes


















          0














          The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.



          See




          • PDF Font encoding

          • Unsearchable, uncopiable PDF document

          • How do I know if the fonts in a PDF file are embedded or not?


          This means there isn't an easy workaround.






          share|improve this answer

































            0














            In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.



            When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.



            Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.






            share|improve this answer
























            • Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

              – Nishanth Lawrence
              Oct 8 '13 at 12:52











            • Probably because the encoding is embedded in the PDF, not the program.

              – Damon
              Oct 8 '13 at 16:53











            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "3"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f655892%2fhow-to-extract-text-from-pdf-with-embedded-subset-fonts%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.



            See




            • PDF Font encoding

            • Unsearchable, uncopiable PDF document

            • How do I know if the fonts in a PDF file are embedded or not?


            This means there isn't an easy workaround.






            share|improve this answer






























              0














              The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.



              See




              • PDF Font encoding

              • Unsearchable, uncopiable PDF document

              • How do I know if the fonts in a PDF file are embedded or not?


              This means there isn't an easy workaround.






              share|improve this answer




























                0












                0








                0







                The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.



                See




                • PDF Font encoding

                • Unsearchable, uncopiable PDF document

                • How do I know if the fonts in a PDF file are embedded or not?


                This means there isn't an easy workaround.






                share|improve this answer















                The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.



                See




                • PDF Font encoding

                • Unsearchable, uncopiable PDF document

                • How do I know if the fonts in a PDF file are embedded or not?


                This means there isn't an easy workaround.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited May 23 '17 at 12:41









                Community

                1




                1










                answered Oct 8 '13 at 9:23









                RedGrittyBrickRedGrittyBrick

                67.1k13106162




                67.1k13106162

























                    0














                    In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.



                    When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.



                    Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.






                    share|improve this answer
























                    • Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                      – Nishanth Lawrence
                      Oct 8 '13 at 12:52











                    • Probably because the encoding is embedded in the PDF, not the program.

                      – Damon
                      Oct 8 '13 at 16:53
















                    0














                    In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.



                    When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.



                    Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.






                    share|improve this answer
























                    • Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                      – Nishanth Lawrence
                      Oct 8 '13 at 12:52











                    • Probably because the encoding is embedded in the PDF, not the program.

                      – Damon
                      Oct 8 '13 at 16:53














                    0












                    0








                    0







                    In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.



                    When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.



                    Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.






                    share|improve this answer













                    In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.



                    When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.



                    Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Oct 8 '13 at 9:45









                    DamonDamon

                    1,67111023




                    1,67111023













                    • Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                      – Nishanth Lawrence
                      Oct 8 '13 at 12:52











                    • Probably because the encoding is embedded in the PDF, not the program.

                      – Damon
                      Oct 8 '13 at 16:53



















                    • Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                      – Nishanth Lawrence
                      Oct 8 '13 at 12:52











                    • Probably because the encoding is embedded in the PDF, not the program.

                      – Damon
                      Oct 8 '13 at 16:53

















                    Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                    – Nishanth Lawrence
                    Oct 8 '13 at 12:52





                    Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ?

                    – Nishanth Lawrence
                    Oct 8 '13 at 12:52













                    Probably because the encoding is embedded in the PDF, not the program.

                    – Damon
                    Oct 8 '13 at 16:53





                    Probably because the encoding is embedded in the PDF, not the program.

                    – Damon
                    Oct 8 '13 at 16:53


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Super User!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f655892%2fhow-to-extract-text-from-pdf-with-embedded-subset-fonts%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

                    Mangá

                    Eduardo VII do Reino Unido