Set up a file indexer and document retrival search engine for a apache-ubuntu public web server?
I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.
My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.
How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?
Thank you for your input!
apache2 search
add a comment |
I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.
My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.
How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?
Thank you for your input!
apache2 search
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30
add a comment |
I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.
My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.
How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?
Thank you for your input!
apache2 search
I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.
My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.
How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?
Thank you for your input!
apache2 search
apache2 search
asked Feb 27 at 16:57
TheMegolithTheMegolith
5461511
5461511
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30
add a comment |
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30
add a comment |
1 Answer
1
active
oldest
votes
Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:
First install the repo and software;
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll
Install mod-wsgi
sudo apt-get install -y libapache2-mod-wsgi
I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.
Get the github repository for the recoll webui:
https://github.com/koniu/recoll-webui
Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'
Double check the it didn't double down on the directories:
Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.
Next edit the file;
/etc/apache2/mods-enabled/wsgi.conf
add the following at the end of the "IfModule" section but not after .
WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
WSGIProcessGroup recoll
Order allow,deny
allow from all
I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.
Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).
make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!
Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart Apache:
sudo apachectl restart
Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).
Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.
Also to view the files over a network your going to have to make a setting change.
Sources:
https://www.lesbonscomptes.com/recoll/download.html
https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html
https://github.com/koniu/recoll-webui
I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1121756%2fset-up-a-file-indexer-and-document-retrival-search-engine-for-a-apache-ubuntu-pu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:
First install the repo and software;
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll
Install mod-wsgi
sudo apt-get install -y libapache2-mod-wsgi
I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.
Get the github repository for the recoll webui:
https://github.com/koniu/recoll-webui
Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'
Double check the it didn't double down on the directories:
Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.
Next edit the file;
/etc/apache2/mods-enabled/wsgi.conf
add the following at the end of the "IfModule" section but not after .
WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
WSGIProcessGroup recoll
Order allow,deny
allow from all
I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.
Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).
make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!
Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart Apache:
sudo apachectl restart
Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).
Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.
Also to view the files over a network your going to have to make a setting change.
Sources:
https://www.lesbonscomptes.com/recoll/download.html
https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html
https://github.com/koniu/recoll-webui
I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.
add a comment |
Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:
First install the repo and software;
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll
Install mod-wsgi
sudo apt-get install -y libapache2-mod-wsgi
I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.
Get the github repository for the recoll webui:
https://github.com/koniu/recoll-webui
Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'
Double check the it didn't double down on the directories:
Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.
Next edit the file;
/etc/apache2/mods-enabled/wsgi.conf
add the following at the end of the "IfModule" section but not after .
WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
WSGIProcessGroup recoll
Order allow,deny
allow from all
I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.
Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).
make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!
Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart Apache:
sudo apachectl restart
Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).
Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.
Also to view the files over a network your going to have to make a setting change.
Sources:
https://www.lesbonscomptes.com/recoll/download.html
https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html
https://github.com/koniu/recoll-webui
I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.
add a comment |
Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:
First install the repo and software;
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll
Install mod-wsgi
sudo apt-get install -y libapache2-mod-wsgi
I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.
Get the github repository for the recoll webui:
https://github.com/koniu/recoll-webui
Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'
Double check the it didn't double down on the directories:
Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.
Next edit the file;
/etc/apache2/mods-enabled/wsgi.conf
add the following at the end of the "IfModule" section but not after .
WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
WSGIProcessGroup recoll
Order allow,deny
allow from all
I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.
Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).
make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!
Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart Apache:
sudo apachectl restart
Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).
Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.
Also to view the files over a network your going to have to make a setting change.
Sources:
https://www.lesbonscomptes.com/recoll/download.html
https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html
https://github.com/koniu/recoll-webui
I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.
Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:
First install the repo and software;
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll
Install mod-wsgi
sudo apt-get install -y libapache2-mod-wsgi
I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.
Get the github repository for the recoll webui:
https://github.com/koniu/recoll-webui
Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'
Double check the it didn't double down on the directories:
Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.
Next edit the file;
/etc/apache2/mods-enabled/wsgi.conf
add the following at the end of the "IfModule" section but not after .
WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
WSGIProcessGroup recoll
Order allow,deny
allow from all
I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.
Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).
make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!
Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart Apache:
sudo apachectl restart
Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).
Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.
Also to view the files over a network your going to have to make a setting change.
Sources:
https://www.lesbonscomptes.com/recoll/download.html
https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html
https://github.com/koniu/recoll-webui
I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.
edited Feb 28 at 1:13
answered Feb 27 at 21:42
TheMegolithTheMegolith
5461511
5461511
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1121756%2fset-up-a-file-indexer-and-document-retrival-search-engine-for-a-apache-ubuntu-pu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search
– ptetteh227
Feb 27 at 17:24
Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.
– TheMegolith
Feb 27 at 17:30