Placing many (10 million) files in one folder












14














I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.



Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?



Example:












share|improve this question




















  • 5




    It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
    – muru
    Feb 11 '15 at 17:12






  • 8




    So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
    – roadmr
    Feb 11 '15 at 18:54






  • 3




    Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
    – Stephen Ostermiller
    Feb 11 '15 at 20:45






  • 2




    There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
    – Hurkyl
    Feb 12 '15 at 7:39












  • From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
    – Wolph
    Feb 12 '15 at 11:17
















14














I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.



Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?



Example:












share|improve this question




















  • 5




    It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
    – muru
    Feb 11 '15 at 17:12






  • 8




    So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
    – roadmr
    Feb 11 '15 at 18:54






  • 3




    Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
    – Stephen Ostermiller
    Feb 11 '15 at 20:45






  • 2




    There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
    – Hurkyl
    Feb 12 '15 at 7:39












  • From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
    – Wolph
    Feb 12 '15 at 11:17














14












14








14


5





I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.



Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?



Example:












share|improve this question















I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.



Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?



Example:









files directory






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 15 '16 at 21:14









Mocking

1034




1034










asked Feb 11 '15 at 17:04









Kohjah Breese

1,08451728




1,08451728








  • 5




    It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
    – muru
    Feb 11 '15 at 17:12






  • 8




    So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
    – roadmr
    Feb 11 '15 at 18:54






  • 3




    Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
    – Stephen Ostermiller
    Feb 11 '15 at 20:45






  • 2




    There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
    – Hurkyl
    Feb 12 '15 at 7:39












  • From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
    – Wolph
    Feb 12 '15 at 11:17














  • 5




    It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
    – muru
    Feb 11 '15 at 17:12






  • 8




    So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
    – roadmr
    Feb 11 '15 at 18:54






  • 3




    Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
    – Stephen Ostermiller
    Feb 11 '15 at 20:45






  • 2




    There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
    – Hurkyl
    Feb 12 '15 at 7:39












  • From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
    – Wolph
    Feb 12 '15 at 11:17








5




5




It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12




It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12




8




8




So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54




So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54




3




3




Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45




Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45




2




2




There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39






There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39














From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17




From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17










3 Answers
3






active

oldest

votes


















24















Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?




Yes. There probably are more reasons but these I can post off the top of my head:




  • tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.


  • ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.


  • commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.




Or is it advisable to split them down into folders?




Yes. Most definitely. Basically you can not even store 10m files in 1 directory.



I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").






share|improve this answer































    1














    Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!



    Benchmark



    Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28






    share|improve this answer

















    • 1




      Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
      – Kohjah Breese
      Dec 22 '18 at 4:01



















    0














    A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.



    Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.



    Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.






    share|improve this answer





















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "89"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f584315%2fplacing-many-10-million-files-in-one-folder%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      24















      Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?




      Yes. There probably are more reasons but these I can post off the top of my head:




      • tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.


      • ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.


      • commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.




      Or is it advisable to split them down into folders?




      Yes. Most definitely. Basically you can not even store 10m files in 1 directory.



      I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").






      share|improve this answer




























        24















        Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?




        Yes. There probably are more reasons but these I can post off the top of my head:




        • tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.


        • ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.


        • commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.




        Or is it advisable to split them down into folders?




        Yes. Most definitely. Basically you can not even store 10m files in 1 directory.



        I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").






        share|improve this answer


























          24












          24








          24







          Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?




          Yes. There probably are more reasons but these I can post off the top of my head:




          • tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.


          • ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.


          • commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.




          Or is it advisable to split them down into folders?




          Yes. Most definitely. Basically you can not even store 10m files in 1 directory.



          I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").






          share|improve this answer















          Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?




          Yes. There probably are more reasons but these I can post off the top of my head:




          • tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.


          • ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.


          • commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.




          Or is it advisable to split them down into folders?




          Yes. Most definitely. Basically you can not even store 10m files in 1 directory.



          I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 11 '15 at 19:51









          muru

          1




          1










          answered Feb 11 '15 at 19:32









          Rinzwind

          204k28389523




          204k28389523

























              1














              Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!



              Benchmark



              Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28






              share|improve this answer

















              • 1




                Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
                – Kohjah Breese
                Dec 22 '18 at 4:01
















              1














              Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!



              Benchmark



              Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28






              share|improve this answer

















              • 1




                Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
                – Kohjah Breese
                Dec 22 '18 at 4:01














              1












              1








              1






              Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!



              Benchmark



              Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28






              share|improve this answer












              Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!



              Benchmark



              Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Dec 22 '18 at 3:46









              Hartator

              1112




              1112








              • 1




                Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
                – Kohjah Breese
                Dec 22 '18 at 4:01














              • 1




                Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
                – Kohjah Breese
                Dec 22 '18 at 4:01








              1




              1




              Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
              – Kohjah Breese
              Dec 22 '18 at 4:01




              Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
              – Kohjah Breese
              Dec 22 '18 at 4:01











              0














              A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.



              Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.



              Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.






              share|improve this answer


























                0














                A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.



                Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.



                Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.






                share|improve this answer
























                  0












                  0








                  0






                  A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.



                  Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.



                  Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.






                  share|improve this answer












                  A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.



                  Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.



                  Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Feb 18 '15 at 1:57









                  Ashok Chand Mathur

                  11




                  11






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Ask Ubuntu!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f584315%2fplacing-many-10-million-files-in-one-folder%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

                      Mangá

                      Eduardo VII do Reino Unido