Set up a file indexer and document retrival search engine for a apache-ubuntu public web server?












0















I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.



My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.



How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?



Thank you for your input!










share|improve this question























  • you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

    – ptetteh227
    Feb 27 at 17:24













  • Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

    – TheMegolith
    Feb 27 at 17:30
















0















I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.



My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.



How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?



Thank you for your input!










share|improve this question























  • you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

    – ptetteh227
    Feb 27 at 17:24













  • Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

    – TheMegolith
    Feb 27 at 17:30














0












0








0








I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.



My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.



How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?



Thank you for your input!










share|improve this question














I am not sure if I am just not searching the right way for a solution but I can't seem to find a straight answer for my project needs, here goes. I apologize if this is a duplicate post.



My company has an Apache web server and we just set up a new section with all of our research documents, they are in HTML and PDF format, we need for our remote clients to be able to input search criteria to find the documents they are searching for on the web server. The document sources are in multiple folders all locally stored on the web server. It is very tedious for our clients to manually parse the directory index to find the documents they require. Not only the file name and meta data are to be indexed but the contents of the files themselves need to be indexed as well. I have done this very easily with Microsoft web servers with their built in indexing and search feature, but performing this operation on Ubuntu with apache2 is proving to be elusive.



How may I set the system up to be able to perform the required document search and retrieval functionality in a style resembling say for example google but just for local content remotely through a web browser?



Thank you for your input!







apache2 search






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 27 at 16:57









TheMegolithTheMegolith

5461511




5461511













  • you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

    – ptetteh227
    Feb 27 at 17:24













  • Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

    – TheMegolith
    Feb 27 at 17:30



















  • you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

    – ptetteh227
    Feb 27 at 17:24













  • Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

    – TheMegolith
    Feb 27 at 17:30

















you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24







you could convert that dir into a webdav dir. then make your clients connect using their filemanagers. they can then use the file manager search tool to search for file names. of if they have grep, cat, ls, tree, etc, they could also use those to do the full text search

– ptetteh227
Feb 27 at 17:24















Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30





Thank you for your reply but that is really not how were looking to do this. It needs to be user friendly for them. Otherwise what we would have done is create a remotely accessible file share over a VPN and have them use something like 'Recoll' to index it.

– TheMegolith
Feb 27 at 17:30










1 Answer
1






active

oldest

votes


















1














Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:



First install the repo and software;



sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install -y recoll python-recoll


Install mod-wsgi



sudo apt-get install -y libapache2-mod-wsgi


I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.



Get the github repository for the recoll webui:



https://github.com/koniu/recoll-webui



Simply click the 'clone or download button' to download the archive.
Extract it to your /var/www directory
It should create the folder 'recoll-webui-master'



Double check the it didn't double down on the directories:



Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.



Next edit the file;



/etc/apache2/mods-enabled/wsgi.conf



add the following at the end of the "IfModule" section but not after .




WSGIDaemonProcess recoll user=dockes group=dockes
threads=1 processes=5 display-name=%{GROUP}
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

WSGIProcessGroup recoll
Order allow,deny
allow from all




I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.



Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).



make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!




Note
the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.




Then run the following to restart Apache:



sudo apachectl restart



Note
Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).




Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.



Also to view the files over a network your going to have to make a setting change.



enter image description here



Sources:



https://www.lesbonscomptes.com/recoll/download.html



https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html



https://github.com/koniu/recoll-webui



I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.






share|improve this answer


























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1121756%2fset-up-a-file-indexer-and-document-retrival-search-engine-for-a-apache-ubuntu-pu%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:



    First install the repo and software;



    sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
    sudo apt-get update
    sudo apt-get install -y recoll python-recoll


    Install mod-wsgi



    sudo apt-get install -y libapache2-mod-wsgi


    I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.



    Get the github repository for the recoll webui:



    https://github.com/koniu/recoll-webui



    Simply click the 'clone or download button' to download the archive.
    Extract it to your /var/www directory
    It should create the folder 'recoll-webui-master'



    Double check the it didn't double down on the directories:



    Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.



    Next edit the file;



    /etc/apache2/mods-enabled/wsgi.conf



    add the following at the end of the "IfModule" section but not after .




    WSGIDaemonProcess recoll user=dockes group=dockes
    threads=1 processes=5 display-name=%{GROUP}
    python-path=/var/www/recoll-webui-master
    WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

    WSGIProcessGroup recoll
    Order allow,deny
    allow from all




    I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.



    Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).



    make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!




    Note
    the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.




    Then run the following to restart Apache:



    sudo apachectl restart



    Note
    Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).




    Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.



    Also to view the files over a network your going to have to make a setting change.



    enter image description here



    Sources:



    https://www.lesbonscomptes.com/recoll/download.html



    https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html



    https://github.com/koniu/recoll-webui



    I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.






    share|improve this answer






























      1














      Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:



      First install the repo and software;



      sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
      sudo apt-get update
      sudo apt-get install -y recoll python-recoll


      Install mod-wsgi



      sudo apt-get install -y libapache2-mod-wsgi


      I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.



      Get the github repository for the recoll webui:



      https://github.com/koniu/recoll-webui



      Simply click the 'clone or download button' to download the archive.
      Extract it to your /var/www directory
      It should create the folder 'recoll-webui-master'



      Double check the it didn't double down on the directories:



      Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.



      Next edit the file;



      /etc/apache2/mods-enabled/wsgi.conf



      add the following at the end of the "IfModule" section but not after .




      WSGIDaemonProcess recoll user=dockes group=dockes
      threads=1 processes=5 display-name=%{GROUP}
      python-path=/var/www/recoll-webui-master
      WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

      WSGIProcessGroup recoll
      Order allow,deny
      allow from all




      I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.



      Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).



      make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!




      Note
      the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.




      Then run the following to restart Apache:



      sudo apachectl restart



      Note
      Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).




      Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.



      Also to view the files over a network your going to have to make a setting change.



      enter image description here



      Sources:



      https://www.lesbonscomptes.com/recoll/download.html



      https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html



      https://github.com/koniu/recoll-webui



      I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.






      share|improve this answer




























        1












        1








        1







        Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:



        First install the repo and software;



        sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
        sudo apt-get update
        sudo apt-get install -y recoll python-recoll


        Install mod-wsgi



        sudo apt-get install -y libapache2-mod-wsgi


        I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.



        Get the github repository for the recoll webui:



        https://github.com/koniu/recoll-webui



        Simply click the 'clone or download button' to download the archive.
        Extract it to your /var/www directory
        It should create the folder 'recoll-webui-master'



        Double check the it didn't double down on the directories:



        Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.



        Next edit the file;



        /etc/apache2/mods-enabled/wsgi.conf



        add the following at the end of the "IfModule" section but not after .




        WSGIDaemonProcess recoll user=dockes group=dockes
        threads=1 processes=5 display-name=%{GROUP}
        python-path=/var/www/recoll-webui-master
        WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

        WSGIProcessGroup recoll
        Order allow,deny
        allow from all




        I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.



        Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).



        make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!




        Note
        the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.




        Then run the following to restart Apache:



        sudo apachectl restart



        Note
        Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).




        Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.



        Also to view the files over a network your going to have to make a setting change.



        enter image description here



        Sources:



        https://www.lesbonscomptes.com/recoll/download.html



        https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html



        https://github.com/koniu/recoll-webui



        I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.






        share|improve this answer















        Well, a buddy of mine saw my post and and texted that he saw that recoll actually has a web interface and I should look into it. They do and it works and setup isn't too cumbersome. I should note that it is in python and very customizable. This is the procedure for Ubuntu users though it works for just about anything, just follow the instructions on the authers page, links to the source material and instructions for other platforms are at the end of this document. I should note that his documentation is sub par and you may have to piece together a final solution like I did:



        First install the repo and software;



        sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
        sudo apt-get update
        sudo apt-get install -y recoll python-recoll


        Install mod-wsgi



        sudo apt-get install -y libapache2-mod-wsgi


        I strongly recommend you already have apache2 already set up or you can get fully qualified domain name and ip address errors. setting the servername to the local ip address of the server should fix this.



        Get the github repository for the recoll webui:



        https://github.com/koniu/recoll-webui



        Simply click the 'clone or download button' to download the archive.
        Extract it to your /var/www directory
        It should create the folder 'recoll-webui-master'



        Double check the it didn't double down on the directories:



        Go to /var/www/recoll-webui-master and make sure the files are there and not further in a sub-directory or you will get errors.



        Next edit the file;



        /etc/apache2/mods-enabled/wsgi.conf



        add the following at the end of the "IfModule" section but not after .




        WSGIDaemonProcess recoll user=dockes group=dockes
        threads=1 processes=5 display-name=%{GROUP}
        python-path=/var/www/recoll-webui-master
        WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py

        WSGIProcessGroup recoll
        Order allow,deny
        allow from all




        I dont know if the formatting change posting here effects the functionality, if it does refer to the authors documentation for the original formatting.



        Change the user and group (dockes in the example) taking care that he is the one who owns the index (.recoll is in his home directory).



        make sure ~/.recoll has the owner name and read write permissions of the account being used on the server with read only permissions for everyone else or you will get error 500 internal server error. Do not use 'root'!




        Note
        the Recoll WebUI application is mostly single-threaded, so it is of little use (and may actually be counter-productive in some cases) to specify multiple threads on the WSGIDaemonProcess line. Specify multiple processes instead to put multiple CPUs to work on simultaneous requests.




        Then run the following to restart Apache:



        sudo apachectl restart



        Note
        Take care that you need a / at the end of the URL used to access the search (use: http://my.server.com/recoll/, not http://my.server.com/recoll), else files other than the script itself are not found (the page looks weird and the search does not work).




        Once you have this all set up you need to run recoll and index the desired folders which apparently can be any folder in your system so be careful not to index folders you do not want exposed.



        Also to view the files over a network your going to have to make a setting change.



        enter image description here



        Sources:



        https://www.lesbonscomptes.com/recoll/download.html



        https://www.lesbonscomptes.com/recoll/pages/recoll-webui-install-wsgi.html



        https://github.com/koniu/recoll-webui



        I hope this helps! Its not 100% what I need but its close and will work fine till I get the time to modify the code to suit the modest changes I want.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Feb 28 at 1:13

























        answered Feb 27 at 21:42









        TheMegolithTheMegolith

        5461511




        5461511






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1121756%2fset-up-a-file-indexer-and-document-retrival-search-engine-for-a-apache-ubuntu-pu%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

            Mangá

            Eduardo VII do Reino Unido