How do I train tesseract to ignore the wavy lines added from spelling and grammar error detection?

I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.

enter image description here

I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.

I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?

If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?

asked Jan 13 '17 at 9:45

GdD

1664

add a comment |

enter image description here

If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?

asked Jan 13 '17 at 9:45

GdD

1664

add a comment |

enter image description here

If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?

asked Jan 13 '17 at 9:45

GdD

1664

enter image description here

If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?

imagemagick tesseract-ocr

asked Jan 13 '17 at 9:45

GdD

1664

asked Jan 13 '17 at 9:45

GdD

1664

asked Jan 13 '17 at 9:45

GdD

1664

asked Jan 13 '17 at 9:45

GdD

1664

asked Jan 13 '17 at 9:45

GdD

1664

add a comment |

1 Answer
1

active

oldest

votes

I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.

Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.

I hope i could help you even if just a little bit.

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

1

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1166768%2fhow-do-i-train-tesseract-to-ignore-the-wavy-lines-added-from-spelling-and-gramma%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I hope i could help you even if just a little bit.

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

1

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

add a comment |

I hope i could help you even if just a little bit.

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

1

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

add a comment |

I hope i could help you even if just a little bit.

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

I hope i could help you even if just a little bit.

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

edited Jan 30 at 9:55

answered Jan 30 at 8:35

Kristóf Horváth

answered Jan 30 at 8:35

Kristóf Horváth

answered Jan 30 at 8:35

Kristóf Horváth

1

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

add a comment |

1

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.

– Kristóf Horváth
Jan 30 at 9:50

Link also includes pictures and how to download

– Kristóf Horváth
Jan 30 at 10:03

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Super User!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtyktl