Data Structures for Counting Duplicates and using std::vector::erase











up vote
4
down vote

favorite












Problem



Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };



Looking for Feedback on



Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?



int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}

return dups;
}









share|improve this question
























  • Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
    – Kyle
    Nov 29 at 8:04















up vote
4
down vote

favorite












Problem



Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };



Looking for Feedback on



Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?



int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}

return dups;
}









share|improve this question
























  • Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
    – Kyle
    Nov 29 at 8:04













up vote
4
down vote

favorite









up vote
4
down vote

favorite











Problem



Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };



Looking for Feedback on



Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?



int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}

return dups;
}









share|improve this question















Problem



Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };



Looking for Feedback on



Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?



int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}

return dups;
}






c++ vectors hash-map set






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 29 at 0:05

























asked Nov 28 at 21:05









greg

29017




29017












  • Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
    – Kyle
    Nov 29 at 8:04


















  • Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
    – Kyle
    Nov 29 at 8:04
















Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
– Kyle
Nov 29 at 8:04




Not really enough for a full answer, but m.insert(pair<int, int>(n, 0)) can be replaced with simply m.emplace(n, 0) saving you from writing out the pair constructor.
– Kyle
Nov 29 at 8:04










2 Answers
2






active

oldest

votes

















up vote
7
down vote



accepted










Basic Algorithm



At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.



In that case, I think I'd do something like this:



int count_dupes(std::vector<int> const &inputs) { 
std::map<int, int> counts;

for (auto i : inputs)
++counts[i];

return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}


I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int (and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.



Parameter Passing



Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.



Logical Comparisons



Comparing a Boolean value to true or false is generally a poor idea. if (x==true) is equivalent to if (x) and if (x == false) is equivalent to if (!x). Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true or false. For example, s.insert(n).second == false wold be better written as: if (!s.insert(n).second).



Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second). I've written C and C++ long enough that I have no difficulty with reading ! as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.



Formatting/Indentation



At least to me, this indentation looks a bit odd:



  if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}


If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:



  if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}


...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.






share|improve this answer























  • If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
    – greg
    Nov 29 at 0:04






  • 1




    Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
    – Jerry Coffin
    Nov 29 at 1:39






  • 2




    It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
    – Incomputable
    Nov 29 at 11:37












  • You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
    – Toby Speight
    Nov 29 at 11:49






  • 1




    @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
    – papagaga
    Nov 29 at 11:57


















up vote
5
down vote













I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.



When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::uniques interface in our largely similar problem? I would say so:



#include <algorithm>

template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}


Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map and then has in all cases a complexity of O(n*log(n)) for populating it + O(n) for counting elements with a frequency higher than 1:




  • if the input range is already sorted, this algorithm has O(n) complexity, which is better


  • if the input range is disposable but not sorted, this algorithm has the same complexity (O(n*log(n)) for prior sorting and O(n) for counting), but doesn't allocate memory and has better cache locality


  • if the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality



On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n)) to O(n) when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.



EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:



#include <vector>
#include <algorithm>
#include <iostream>

template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}


int main() {

std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';

}





share|improve this answer























    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208648%2fdata-structures-for-counting-duplicates-and-using-stdvectorerase%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    7
    down vote



    accepted










    Basic Algorithm



    At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.



    In that case, I think I'd do something like this:



    int count_dupes(std::vector<int> const &inputs) { 
    std::map<int, int> counts;

    for (auto i : inputs)
    ++counts[i];

    return std::count_if(counts.begin(), counts.end(),
    (auto const &p) { return p.second >= 2; });
    }


    I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int (and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.



    Parameter Passing



    Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.



    Logical Comparisons



    Comparing a Boolean value to true or false is generally a poor idea. if (x==true) is equivalent to if (x) and if (x == false) is equivalent to if (!x). Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true or false. For example, s.insert(n).second == false wold be better written as: if (!s.insert(n).second).



    Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second). I've written C and C++ long enough that I have no difficulty with reading ! as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.



    Formatting/Indentation



    At least to me, this indentation looks a bit odd:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    ...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.






    share|improve this answer























    • If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
      – greg
      Nov 29 at 0:04






    • 1




      Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
      – Jerry Coffin
      Nov 29 at 1:39






    • 2




      It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
      – Incomputable
      Nov 29 at 11:37












    • You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
      – Toby Speight
      Nov 29 at 11:49






    • 1




      @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
      – papagaga
      Nov 29 at 11:57















    up vote
    7
    down vote



    accepted










    Basic Algorithm



    At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.



    In that case, I think I'd do something like this:



    int count_dupes(std::vector<int> const &inputs) { 
    std::map<int, int> counts;

    for (auto i : inputs)
    ++counts[i];

    return std::count_if(counts.begin(), counts.end(),
    (auto const &p) { return p.second >= 2; });
    }


    I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int (and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.



    Parameter Passing



    Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.



    Logical Comparisons



    Comparing a Boolean value to true or false is generally a poor idea. if (x==true) is equivalent to if (x) and if (x == false) is equivalent to if (!x). Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true or false. For example, s.insert(n).second == false wold be better written as: if (!s.insert(n).second).



    Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second). I've written C and C++ long enough that I have no difficulty with reading ! as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.



    Formatting/Indentation



    At least to me, this indentation looks a bit odd:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    ...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.






    share|improve this answer























    • If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
      – greg
      Nov 29 at 0:04






    • 1




      Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
      – Jerry Coffin
      Nov 29 at 1:39






    • 2




      It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
      – Incomputable
      Nov 29 at 11:37












    • You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
      – Toby Speight
      Nov 29 at 11:49






    • 1




      @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
      – papagaga
      Nov 29 at 11:57













    up vote
    7
    down vote



    accepted







    up vote
    7
    down vote



    accepted






    Basic Algorithm



    At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.



    In that case, I think I'd do something like this:



    int count_dupes(std::vector<int> const &inputs) { 
    std::map<int, int> counts;

    for (auto i : inputs)
    ++counts[i];

    return std::count_if(counts.begin(), counts.end(),
    (auto const &p) { return p.second >= 2; });
    }


    I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int (and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.



    Parameter Passing



    Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.



    Logical Comparisons



    Comparing a Boolean value to true or false is generally a poor idea. if (x==true) is equivalent to if (x) and if (x == false) is equivalent to if (!x). Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true or false. For example, s.insert(n).second == false wold be better written as: if (!s.insert(n).second).



    Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second). I've written C and C++ long enough that I have no difficulty with reading ! as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.



    Formatting/Indentation



    At least to me, this indentation looks a bit odd:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    ...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.






    share|improve this answer














    Basic Algorithm



    At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.



    In that case, I think I'd do something like this:



    int count_dupes(std::vector<int> const &inputs) { 
    std::map<int, int> counts;

    for (auto i : inputs)
    ++counts[i];

    return std::count_if(counts.begin(), counts.end(),
    (auto const &p) { return p.second >= 2; });
    }


    I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int (and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.



    Parameter Passing



    Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.



    Logical Comparisons



    Comparing a Boolean value to true or false is generally a poor idea. if (x==true) is equivalent to if (x) and if (x == false) is equivalent to if (!x). Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true or false. For example, s.insert(n).second == false wold be better written as: if (!s.insert(n).second).



    Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second). I've written C and C++ long enough that I have no difficulty with reading ! as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.



    Formatting/Indentation



    At least to me, this indentation looks a bit odd:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:



      if (s.insert(n).second == false && m.find(n) == m.end()) {
    dups++;
    m.insert(pair<int, int>(n,0));
    // better to remove from vector than increase space with the map?
    // numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
    } else {
    s.insert(n);
    }


    ...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 29 at 15:16

























    answered Nov 28 at 23:24









    Jerry Coffin

    27.8k460125




    27.8k460125












    • If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
      – greg
      Nov 29 at 0:04






    • 1




      Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
      – Jerry Coffin
      Nov 29 at 1:39






    • 2




      It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
      – Incomputable
      Nov 29 at 11:37












    • You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
      – Toby Speight
      Nov 29 at 11:49






    • 1




      @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
      – papagaga
      Nov 29 at 11:57


















    • If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
      – greg
      Nov 29 at 0:04






    • 1




      Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
      – Jerry Coffin
      Nov 29 at 1:39






    • 2




      It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
      – Incomputable
      Nov 29 at 11:37












    • You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
      – Toby Speight
      Nov 29 at 11:49






    • 1




      @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
      – papagaga
      Nov 29 at 11:57
















    If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
    – greg
    Nov 29 at 0:04




    If my understanding is correct, count_if iterates over counts and increments p when it finds a unique input that occurred at least twice. ++counts[i] is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false and !, I've used false more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
    – greg
    Nov 29 at 0:04




    1




    1




    Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
    – Jerry Coffin
    Nov 29 at 1:39




    Yes, ++counts[i] will insert a new record for i if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++ will then increment the current count. count_if just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true.
    – Jerry Coffin
    Nov 29 at 1:39




    2




    2




    It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
    – Incomputable
    Nov 29 at 11:37






    It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply std::unique and get std::distance between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
    – Incomputable
    Nov 29 at 11:37














    You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
    – Toby Speight
    Nov 29 at 11:49




    You might want to attach a caveat to the suggestion to use an array, since int has a much wider range of values than char (so we're not in quite the same context as that other question).
    – Toby Speight
    Nov 29 at 11:49




    1




    1




    @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
    – papagaga
    Nov 29 at 11:57




    @Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly std::unique doesn't do the job if what we need to count is the number of elements appearing at least twice
    – papagaga
    Nov 29 at 11:57












    up vote
    5
    down vote













    I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.



    When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::uniques interface in our largely similar problem? I would say so:



    #include <algorithm>

    template <typename Iterator>
    int count_duplicates(Iterator first, Iterator last) {
    // requires a sorted range
    int count = 0;
    while (true) {
    first = std::adjacent_find(first, last);
    if (first == last) return count;
    first = std::adjacent_find(++first, last, std::not_equal_to<>());
    ++count;
    }
    }


    Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map and then has in all cases a complexity of O(n*log(n)) for populating it + O(n) for counting elements with a frequency higher than 1:




    • if the input range is already sorted, this algorithm has O(n) complexity, which is better


    • if the input range is disposable but not sorted, this algorithm has the same complexity (O(n*log(n)) for prior sorting and O(n) for counting), but doesn't allocate memory and has better cache locality


    • if the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality



    On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n)) to O(n) when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.



    EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:



    #include <vector>
    #include <algorithm>
    #include <iostream>

    template <typename Iterator>
    Iterator one_of_duplicates(Iterator first, Iterator last) {
    // requires a sorted input
    auto current = first;
    while (true) {
    // find a duplicated element, move it behind 'first'
    // and find the next different element
    current = std::adjacent_find(current, last);
    if (current == last) return first;
    *first++ = std::move(*current);
    std::cerr << *current << std::endl;
    current = std::adjacent_find(current, last, std::not_equal_to<>());
    }
    }


    int main() {

    std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
    std::sort(data.begin(), data.end());
    data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
    for (auto i : data) std::cout << i << ',';

    }





    share|improve this answer



























      up vote
      5
      down vote













      I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.



      When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::uniques interface in our largely similar problem? I would say so:



      #include <algorithm>

      template <typename Iterator>
      int count_duplicates(Iterator first, Iterator last) {
      // requires a sorted range
      int count = 0;
      while (true) {
      first = std::adjacent_find(first, last);
      if (first == last) return count;
      first = std::adjacent_find(++first, last, std::not_equal_to<>());
      ++count;
      }
      }


      Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map and then has in all cases a complexity of O(n*log(n)) for populating it + O(n) for counting elements with a frequency higher than 1:




      • if the input range is already sorted, this algorithm has O(n) complexity, which is better


      • if the input range is disposable but not sorted, this algorithm has the same complexity (O(n*log(n)) for prior sorting and O(n) for counting), but doesn't allocate memory and has better cache locality


      • if the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality



      On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n)) to O(n) when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.



      EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:



      #include <vector>
      #include <algorithm>
      #include <iostream>

      template <typename Iterator>
      Iterator one_of_duplicates(Iterator first, Iterator last) {
      // requires a sorted input
      auto current = first;
      while (true) {
      // find a duplicated element, move it behind 'first'
      // and find the next different element
      current = std::adjacent_find(current, last);
      if (current == last) return first;
      *first++ = std::move(*current);
      std::cerr << *current << std::endl;
      current = std::adjacent_find(current, last, std::not_equal_to<>());
      }
      }


      int main() {

      std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
      std::sort(data.begin(), data.end());
      data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
      for (auto i : data) std::cout << i << ',';

      }





      share|improve this answer

























        up vote
        5
        down vote










        up vote
        5
        down vote









        I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.



        When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::uniques interface in our largely similar problem? I would say so:



        #include <algorithm>

        template <typename Iterator>
        int count_duplicates(Iterator first, Iterator last) {
        // requires a sorted range
        int count = 0;
        while (true) {
        first = std::adjacent_find(first, last);
        if (first == last) return count;
        first = std::adjacent_find(++first, last, std::not_equal_to<>());
        ++count;
        }
        }


        Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map and then has in all cases a complexity of O(n*log(n)) for populating it + O(n) for counting elements with a frequency higher than 1:




        • if the input range is already sorted, this algorithm has O(n) complexity, which is better


        • if the input range is disposable but not sorted, this algorithm has the same complexity (O(n*log(n)) for prior sorting and O(n) for counting), but doesn't allocate memory and has better cache locality


        • if the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality



        On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n)) to O(n) when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.



        EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:



        #include <vector>
        #include <algorithm>
        #include <iostream>

        template <typename Iterator>
        Iterator one_of_duplicates(Iterator first, Iterator last) {
        // requires a sorted input
        auto current = first;
        while (true) {
        // find a duplicated element, move it behind 'first'
        // and find the next different element
        current = std::adjacent_find(current, last);
        if (current == last) return first;
        *first++ = std::move(*current);
        std::cerr << *current << std::endl;
        current = std::adjacent_find(current, last, std::not_equal_to<>());
        }
        }


        int main() {

        std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
        std::sort(data.begin(), data.end());
        data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
        for (auto i : data) std::cout << i << ',';

        }





        share|improve this answer














        I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.



        When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::uniques interface in our largely similar problem? I would say so:



        #include <algorithm>

        template <typename Iterator>
        int count_duplicates(Iterator first, Iterator last) {
        // requires a sorted range
        int count = 0;
        while (true) {
        first = std::adjacent_find(first, last);
        if (first == last) return count;
        first = std::adjacent_find(++first, last, std::not_equal_to<>());
        ++count;
        }
        }


        Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map and then has in all cases a complexity of O(n*log(n)) for populating it + O(n) for counting elements with a frequency higher than 1:




        • if the input range is already sorted, this algorithm has O(n) complexity, which is better


        • if the input range is disposable but not sorted, this algorithm has the same complexity (O(n*log(n)) for prior sorting and O(n) for counting), but doesn't allocate memory and has better cache locality


        • if the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality



        On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n)) to O(n) when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.



        EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:



        #include <vector>
        #include <algorithm>
        #include <iostream>

        template <typename Iterator>
        Iterator one_of_duplicates(Iterator first, Iterator last) {
        // requires a sorted input
        auto current = first;
        while (true) {
        // find a duplicated element, move it behind 'first'
        // and find the next different element
        current = std::adjacent_find(current, last);
        if (current == last) return first;
        *first++ = std::move(*current);
        std::cerr << *current << std::endl;
        current = std::adjacent_find(current, last, std::not_equal_to<>());
        }
        }


        int main() {

        std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
        std::sort(data.begin(), data.end());
        data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
        for (auto i : data) std::cout << i << ',';

        }






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 29 at 15:40

























        answered Nov 29 at 11:26









        papagaga

        3,989221




        3,989221






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208648%2fdata-structures-for-counting-duplicates-and-using-stdvectorerase%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

            Mangá

            Eduardo VII do Reino Unido