Set up a file indexer and document retrival search engine for a apache-ubuntu public web server?

I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.

My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.

How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?

Thank you for your input!

asked Feb 27 at 16:57

TheMegolith

5461511

you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24

Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30

add a comment |

I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.

Thank you for your input!

asked Feb 27 at 16:57

TheMegolith

5461511

you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24

Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30

add a comment |

I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.

Thank you for your input!

asked Feb 27 at 16:57

TheMegolith

5461511

I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.

Thank you for your input!

apache2 search

asked Feb 27 at 16:57

TheMegolith

5461511

asked Feb 27 at 16:57

TheMegolith

5461511

asked Feb 27 at 16:57

TheMegolith

5461511

asked Feb 27 at 16:57

TheMegolith

5461511

asked Feb 27 at 16:57

TheMegolith

5461511

you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24

Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30

add a comment |

you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24

Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30

you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24

Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30

add a comment |

1 Answer
1

active

oldest

votes

Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:

First install the repo and software;

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on

sudo apt-get update 

sudo apt-get install -y recoll python-recoll

Install mod-wsgi

sudo apt-get install -y libapache2-mod-wsgi

I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.

Get the github repository for the recoll webui:

https://github.com/koniu/recoll-webui

Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'

Double check the it didn't double down on the directories:

Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.

Next edit the file;

/etc/apache2/mods-enabled/wsgi.conf

add the following at the end of the "IfModule" section but not after .

WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all

I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.

Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).

make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!

Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.

Then run the following to restart Apache:

sudo apachectl restart

Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).

Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.

Also to view the files over a network your going to have to make a setting change.

enter image description here

Sources:

https://www.lesbonscomptes.com/recoll/download.html

https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html

https://github.com/koniu/recoll-webui

I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1121756%2fset-up-a-file-indexer-and-document-retrival-search-engine-for-a-apache-ubuntu-pu%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

First install the repo and software;

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on

sudo apt-get update 

sudo apt-get install -y recoll python-recoll

Install mod-wsgi

sudo apt-get install -y libapache2-mod-wsgi

Get the github repository for the recoll webui:

https://github.com/koniu/recoll-webui

Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'

Double check the it didn't double down on the directories:

Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.

Next edit the file;

/etc/apache2/mods-enabled/wsgi.conf

add the following at the end of the "IfModule" section but not after .

WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all

I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.

Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).

Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.

Then run the following to restart Apache:

sudo apachectl restart

Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).

Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.

Also to view the files over a network your going to have to make a setting change.

enter image description here

Sources:

https://www.lesbonscomptes.com/recoll/download.html

https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html

https://github.com/koniu/recoll-webui

I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

add a comment |

First install the repo and software;

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on

sudo apt-get update 

sudo apt-get install -y recoll python-recoll

Install mod-wsgi

sudo apt-get install -y libapache2-mod-wsgi

Get the github repository for the recoll webui:

https://github.com/koniu/recoll-webui

Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'

Double check the it didn't double down on the directories:

Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.

Next edit the file;

/etc/apache2/mods-enabled/wsgi.conf

add the following at the end of the "IfModule" section but not after .

WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all

I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.

Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).

Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.

Then run the following to restart Apache:

sudo apachectl restart

Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).

Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.

Also to view the files over a network your going to have to make a setting change.

enter image description here

Sources:

https://www.lesbonscomptes.com/recoll/download.html

https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html

https://github.com/koniu/recoll-webui

I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

add a comment |

First install the repo and software;

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on

sudo apt-get update 

sudo apt-get install -y recoll python-recoll

Install mod-wsgi

sudo apt-get install -y libapache2-mod-wsgi

Get the github repository for the recoll webui:

https://github.com/koniu/recoll-webui

Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'

Double check the it didn't double down on the directories:

Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.

Next edit the file;

/etc/apache2/mods-enabled/wsgi.conf

add the following at the end of the "IfModule" section but not after .

WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all

I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.

Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).

Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.

Then run the following to restart Apache:

sudo apachectl restart

Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).

Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.

Also to view the files over a network your going to have to make a setting change.

enter image description here

Sources:

https://www.lesbonscomptes.com/recoll/download.html

https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html

https://github.com/koniu/recoll-webui

I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

First install the repo and software;

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on

sudo apt-get update 

sudo apt-get install -y recoll python-recoll

Install mod-wsgi

sudo apt-get install -y libapache2-mod-wsgi

Get the github repository for the recoll webui:

https://github.com/koniu/recoll-webui

Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'

Double check the it didn't double down on the directories:

Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.

Next edit the file;

/etc/apache2/mods-enabled/wsgi.conf

add the following at the end of the "IfModule" section but not after .

WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all

I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.

Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).

Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.

Then run the following to restart Apache:

sudo apachectl restart

Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).

Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.

Also to view the files over a network your going to have to make a setting change.

enter image description here

Sources:

https://www.lesbonscomptes.com/recoll/download.html

https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html

https://github.com/koniu/recoll-webui

I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

edited Feb 28 at 1:13

answered Feb 27 at 21:42

TheMegolith

5461511

answered Feb 27 at 21:42

TheMegolith

5461511

answered Feb 27 at 21:42

TheMegolith

5461511

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtyktl