Placing many (10 million) files in one folder
I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.
Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?
Example:
files directory
add a comment |
I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.
Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?
Example:
files directory
5
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
8
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
3
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
2
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version ofos.listdir
in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17
add a comment |
I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.
Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?
Example:
files directory
I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.
Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?
Example:
files directory
files directory
edited Apr 15 '16 at 21:14
Mocking
1034
1034
asked Feb 11 '15 at 17:04
Kohjah Breese
1,08451728
1,08451728
5
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
8
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
3
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
2
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version ofos.listdir
in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17
add a comment |
5
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
8
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
3
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
2
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version ofos.listdir
in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17
5
5
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
8
8
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
3
3
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
2
2
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of
os.listdir
in python was flatly denied for this reason.– Hurkyl
Feb 12 '15 at 7:39
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of
os.listdir
in python was flatly denied for this reason.– Hurkyl
Feb 12 '15 at 7:39
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17
add a comment |
3 Answers
3
active
oldest
votes
Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?
Yes. There probably are more reasons but these I can post off the top of my head:
tune2fs
has an option calleddir_index
that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.ext
filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Usedf -i
for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.commands like
rm
andls
when using wildcards expand the command and will end up with a "argument list too long". You will have to usefind
to delete or list files. Andfind
tends to be slow.
Or is it advisable to split them down into folders?
Yes. Most definitely. Basically you can not even store 10m files in 1 directory.
I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").
add a comment |
Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!
Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
add a comment |
A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.
Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.
Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f584315%2fplacing-many-10-million-files-in-one-folder%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?
Yes. There probably are more reasons but these I can post off the top of my head:
tune2fs
has an option calleddir_index
that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.ext
filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Usedf -i
for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.commands like
rm
andls
when using wildcards expand the command and will end up with a "argument list too long". You will have to usefind
to delete or list files. Andfind
tends to be slow.
Or is it advisable to split them down into folders?
Yes. Most definitely. Basically you can not even store 10m files in 1 directory.
I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").
add a comment |
Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?
Yes. There probably are more reasons but these I can post off the top of my head:
tune2fs
has an option calleddir_index
that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.ext
filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Usedf -i
for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.commands like
rm
andls
when using wildcards expand the command and will end up with a "argument list too long". You will have to usefind
to delete or list files. Andfind
tends to be slow.
Or is it advisable to split them down into folders?
Yes. Most definitely. Basically you can not even store 10m files in 1 directory.
I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").
add a comment |
Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?
Yes. There probably are more reasons but these I can post off the top of my head:
tune2fs
has an option calleddir_index
that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.ext
filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Usedf -i
for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.commands like
rm
andls
when using wildcards expand the command and will end up with a "argument list too long". You will have to usefind
to delete or list files. Andfind
tends to be slow.
Or is it advisable to split them down into folders?
Yes. Most definitely. Basically you can not even store 10m files in 1 directory.
I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").
Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?
Yes. There probably are more reasons but these I can post off the top of my head:
tune2fs
has an option calleddir_index
that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.ext
filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Usedf -i
for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.commands like
rm
andls
when using wildcards expand the command and will end up with a "argument list too long". You will have to usefind
to delete or list files. Andfind
tends to be slow.
Or is it advisable to split them down into folders?
Yes. Most definitely. Basically you can not even store 10m files in 1 directory.
I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").
edited Feb 11 '15 at 19:51
muru
1
1
answered Feb 11 '15 at 19:32
Rinzwind
204k28389523
204k28389523
add a comment |
add a comment |
Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!
Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
add a comment |
Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!
Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
add a comment |
Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!
Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28
Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!
Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28
answered Dec 22 '18 at 3:46
Hartator
1112
1112
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
add a comment |
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
1
1
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01
add a comment |
A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.
Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.
Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.
add a comment |
A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.
Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.
Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.
add a comment |
A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.
Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.
Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.
A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.
Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.
Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.
answered Feb 18 '15 at 1:57
Ashok Chand Mathur
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f584315%2fplacing-many-10-million-files-in-one-folder%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12
8
So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54
3
Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45
2
There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of
os.listdir
in python was flatly denied for this reason.– Hurkyl
Feb 12 '15 at 7:39
From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17