How do I train tesseract to ignore the wavy lines added from spelling and grammar error detection?












1















I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.



enter image description here



I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.



I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?



If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?










share|improve this question



























    1















    I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.



    enter image description here



    I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.



    I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?



    If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?










    share|improve this question

























      1












      1








      1


      1






      I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.



      enter image description here



      I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.



      I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?



      If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?










      share|improve this question














      I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.



      enter image description here



      I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.



      I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?



      If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?







      imagemagick tesseract-ocr






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 13 '17 at 9:45









      GdDGdD

      1664




      1664






















          1 Answer
          1






          active

          oldest

          votes


















          0














          I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.



          Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.



          I hope i could help you even if just a little bit.






          share|improve this answer





















          • 1





            Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

            – Kristóf Horváth
            Jan 30 at 9:50











          • Link also includes pictures and how to download

            – Kristóf Horváth
            Jan 30 at 10:03











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "3"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1166768%2fhow-do-i-train-tesseract-to-ignore-the-wavy-lines-added-from-spelling-and-gramma%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.



          Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.



          I hope i could help you even if just a little bit.






          share|improve this answer





















          • 1





            Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

            – Kristóf Horváth
            Jan 30 at 9:50











          • Link also includes pictures and how to download

            – Kristóf Horváth
            Jan 30 at 10:03
















          0














          I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.



          Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.



          I hope i could help you even if just a little bit.






          share|improve this answer





















          • 1





            Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

            – Kristóf Horváth
            Jan 30 at 9:50











          • Link also includes pictures and how to download

            – Kristóf Horváth
            Jan 30 at 10:03














          0












          0








          0







          I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.



          Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.



          I hope i could help you even if just a little bit.






          share|improve this answer















          I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.



          Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.



          I hope i could help you even if just a little bit.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 30 at 9:55

























          answered Jan 30 at 8:35









          Kristóf HorváthKristóf Horváth

          12




          12








          • 1





            Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

            – Kristóf Horváth
            Jan 30 at 9:50











          • Link also includes pictures and how to download

            – Kristóf Horváth
            Jan 30 at 10:03














          • 1





            Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

            – Kristóf Horváth
            Jan 30 at 9:50











          • Link also includes pictures and how to download

            – Kristóf Horváth
            Jan 30 at 10:03








          1




          1





          Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

          – Kristóf Horváth
          Jan 30 at 9:50





          Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

          – Kristóf Horváth
          Jan 30 at 9:50













          Link also includes pictures and how to download

          – Kristóf Horváth
          Jan 30 at 10:03





          Link also includes pictures and how to download

          – Kristóf Horváth
          Jan 30 at 10:03


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Super User!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1166768%2fhow-do-i-train-tesseract-to-ignore-the-wavy-lines-added-from-spelling-and-gramma%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

          Mangá

          Eduardo VII do Reino Unido