Machine learning testing data
$begingroup$
I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0
Any help is greatly appreciated
machine-learning python
New contributor
$endgroup$
add a comment |
$begingroup$
I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0
Any help is greatly appreciated
machine-learning python
New contributor
$endgroup$
add a comment |
$begingroup$
I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0
Any help is greatly appreciated
machine-learning python
New contributor
$endgroup$
I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0
Any help is greatly appreciated
machine-learning python
machine-learning python
New contributor
New contributor
New contributor
asked 2 hours ago
JackJack
61
61
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
Data Leakage.
Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.
Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.
I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..
In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.
$endgroup$
add a comment |
$begingroup$
If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.
However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.
As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.
One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.
It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.
You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Jack is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48459%2fmachine-learning-testing-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
Data Leakage.
Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.
Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.
I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..
In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.
$endgroup$
add a comment |
$begingroup$
No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
Data Leakage.
Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.
Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.
I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..
In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.
$endgroup$
add a comment |
$begingroup$
No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
Data Leakage.
Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.
Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.
I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..
In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.
$endgroup$
No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
Data Leakage.
Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.
Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.
I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..
In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.
edited 2 hours ago
answered 2 hours ago
BlenzusBlenzus
10610
10610
add a comment |
add a comment |
$begingroup$
If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.
However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.
As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.
One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.
It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.
You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.
New contributor
$endgroup$
add a comment |
$begingroup$
If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.
However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.
As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.
One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.
It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.
You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.
New contributor
$endgroup$
add a comment |
$begingroup$
If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.
However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.
As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.
One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.
It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.
You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.
New contributor
$endgroup$
If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.
However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.
As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.
One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.
It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.
You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.
New contributor
New contributor
answered 1 min ago
Carl DombrowskiCarl Dombrowski
1
1
New contributor
New contributor
add a comment |
add a comment |
Jack is a new contributor. Be nice, and check out our Code of Conduct.
Jack is a new contributor. Be nice, and check out our Code of Conduct.
Jack is a new contributor. Be nice, and check out our Code of Conduct.
Jack is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48459%2fmachine-learning-testing-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown