Placing many (10 million) files in one folder

I've just added a predictive search (see example below) feature to my site that runs on a Ubuntu Server. This runs direct from a database. I want to cache the result for each search and use that if it exists, else create it.

Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?

Example:

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

5

It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12

8

So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54

3

Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45

2

There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39

From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17

add a comment |

Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?

Example:

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

5

It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12

8

So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54

3

Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45

2

There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39

From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17

add a comment |

Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?

Example:

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

Would there be any problem with me saving the potential cira 10 million results in separate files in one directory? Or is it advisable to split them down into folders?

Example:

files directory

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

edited Apr 15 '16 at 21:14

Mocking

1034

edited Apr 15 '16 at 21:14

Mocking

1034

edited Apr 15 '16 at 21:14

Mocking

1034

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

asked Feb 11 '15 at 17:04

Kohjah Breese

1,08451728

5

It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12

8

So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54

3

Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45

2

There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39

From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17

add a comment |

5

It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12

8

So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54

3

Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45

2

There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39

From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17

It would be better to split. Any command which tries to list the contents of that directory will likely decide to shoot itself.
– muru
Feb 11 '15 at 17:12

So if you already have a database, why not use that? I'm sure the DBMS will be better able to handle millions of records vs. the filesystem. If you're dead-set on using the filesystem you need to come up with a splitting scheme using some sort of hash, at this point IMHO it sounds like using the DB will be less work.
– roadmr
Feb 11 '15 at 18:54

Another option for caching that would fit your model better might be memcached or redis. They are key value stores (so they act like a single directory and you access items just by name). Redis is persistent (won't lose data when it is restarted) where as memcached is for more temporary items.
– Stephen Ostermiller
Feb 11 '15 at 20:45

There's a chicken-and-egg problem here. Tool developers don't handle directories with large numbers of files because people don't do that. And people don't make directories with large numbers of files because tools don't support it well. e.g. I understand at one time (and I believe this to still be true), a feature request to make a generator version of os.listdir in python was flatly denied for this reason.
– Hurkyl
Feb 12 '15 at 7:39

From my own experience I've seen breakage when going over 32k files in a single directory on Linux 2.6. It's possible to tune beyond this point of course, but I wouldn't recommend it. Just split into a few layers of subdirectories and it will be much better. Personally I'd limit it to around 10,000 per directory which would give you 2 layers.
– Wolph
Feb 12 '15 at 11:17

add a comment |

3 Answers
3

active

oldest

votes

Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?

Yes. There probably are more reasons but these I can post off the top of my head:

tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.

ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.

commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.

Or is it advisable to split them down into folders?

Yes. Most definitely. Basically you can not even store 10m files in 1 directory.

I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

add a comment |

Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!

Benchmark

Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28

answered Dec 22 '18 at 3:46

Hartator

1112

1

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

add a comment |

A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.

Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.

Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f584315%2fplacing-many-10-million-files-in-one-folder%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?

Yes. There probably are more reasons but these I can post off the top of my head:

tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.

ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.

commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.

Or is it advisable to split them down into folders?

Yes. Most definitely. Basically you can not even store 10m files in 1 directory.

I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

add a comment |

Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?

Yes. There probably are more reasons but these I can post off the top of my head:

tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.

ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.

commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.

Or is it advisable to split them down into folders?

Yes. Most definitely. Basically you can not even store 10m files in 1 directory.

I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

add a comment |

Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?

Yes. There probably are more reasons but these I can post off the top of my head:

tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.

ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.

commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.

Or is it advisable to split them down into folders?

Yes. Most definitely. Basically you can not even store 10m files in 1 directory.

I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

Would there be any problem with me saving the potential circa 10 million results in separate files in one directory?

Yes. There probably are more reasons but these I can post off the top of my head:

tune2fs has an option called dir_index that tends to be turned on by default (on Ubuntu it is) that lets you store roughly 100k files in a directory before you see a performance hit. That is not even close to the 10m files you are thinking about.

ext filesystems have a fixed maximum number of inodes. Every file and directory uses 1 inode. Use df -i for a view of your partitions and inodes free. When you run out of inodes you can not make new files or folders.

commands like rm and ls when using wildcards expand the command and will end up with a "argument list too long". You will have to use find to delete or list files. And find tends to be slow.

Or is it advisable to split them down into folders?

Yes. Most definitely. Basically you can not even store 10m files in 1 directory.

I would use the database. If you want to cache it for a website have a look at "solr" ("providing distributed indexing, replication and load-balanced querying").

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

edited Feb 11 '15 at 19:51

muru

edited Feb 11 '15 at 19:51

muru

edited Feb 11 '15 at 19:51

muru

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

answered Feb 11 '15 at 19:32

Rinzwind

204k28389523

add a comment |

Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!

Benchmark

Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28

answered Dec 22 '18 at 3:46

Hartator

1112

1

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

add a comment |

Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!

Benchmark

Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28

answered Dec 22 '18 at 3:46

Hartator

1112

1

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

add a comment |

Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!

Benchmark

Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28

answered Dec 22 '18 at 3:46

Hartator

1112

Ended up with same issue. Run my own benchmarks to find out if you can place everything in the same folder vs. having multiple folders. It appears you can and it's faster!

Benchmark

Ref: https://medium.com/@hartator/benchmark-deep-directory-structure-vs-flat-directory-structure-to-store-millions-of-files-on-ext4-cac1000ca28

answered Dec 22 '18 at 3:46

Hartator

1112

answered Dec 22 '18 at 3:46

Hartator

1112

answered Dec 22 '18 at 3:46

Hartator

1112

answered Dec 22 '18 at 3:46

Hartator

1112

1

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

add a comment |

1

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

Thanks, this is very useful. I am using this on a site and it's been a real problem to re-program this part to have /abc/efg.html directory structures. So I will move back to a flat directory in future.
– Kohjah Breese
Dec 22 '18 at 4:01

add a comment |

A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.

Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.

Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

add a comment |

A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.

Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.

Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

add a comment |

A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.

Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.

Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

A binary search can easily handle millions of records so searching the single directory would not be a problem. It will do so very fast.

Basically if you are using a 32 bit system, binary search upto 2Gb records is easy and good.

Berekely DB, an open source software, would readily allow you to store the full result under one entry and would have the search built in.

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

answered Feb 18 '15 at 1:57

Ashok Chand Mathur

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtyktl