Really weird “UTF-8” code












0















I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?










share|improve this question




















  • 1





    Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

    – baelx
    Feb 5 at 4:52






  • 1





    I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

    – JTalbott
    Feb 5 at 4:56


















0















I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?










share|improve this question




















  • 1





    Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

    – baelx
    Feb 5 at 4:52






  • 1





    I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

    – JTalbott
    Feb 5 at 4:56
















0












0








0








I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?










share|improve this question
















I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?







python csv database utf-8






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 5 at 4:55







JTalbott

















asked Feb 5 at 4:43









JTalbottJTalbott

11




11








  • 1





    Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

    – baelx
    Feb 5 at 4:52






  • 1





    I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

    – JTalbott
    Feb 5 at 4:56
















  • 1





    Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

    – baelx
    Feb 5 at 4:52






  • 1





    I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

    – JTalbott
    Feb 5 at 4:56










1




1





Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

– baelx
Feb 5 at 4:52





Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)

– baelx
Feb 5 at 4:52




1




1





I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

– JTalbott
Feb 5 at 4:56







I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.

– JTalbott
Feb 5 at 4:56












2 Answers
2






active

oldest

votes


















1














Thanks for that added detail.



That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.



As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.



Once unpacked, you should get a file with .tsv extension like IMBD indicates.



If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.



Hope that helps! Feel free to comment with your progress.






share|improve this answer





















  • 2





    No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

    – dave_thompson_085
    Feb 5 at 5:35








  • 1





    Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

    – baelx
    Feb 5 at 5:48



















1














I have downloaded and used title.ratings.tsv.gz. There is no problem.



These the steps, to open it:




  • uncompress it (if you are a win user, you can use 7zip utility);

  • than simply open it.


If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).



In NotePad it appears is in this way



enter image description here






share|improve this answer

























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "3"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402103%2freally-weird-utf-8-code%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Thanks for that added detail.



    That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.



    As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.



    Once unpacked, you should get a file with .tsv extension like IMBD indicates.



    If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.



    Hope that helps! Feel free to comment with your progress.






    share|improve this answer





















    • 2





      No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

      – dave_thompson_085
      Feb 5 at 5:35








    • 1





      Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

      – baelx
      Feb 5 at 5:48
















    1














    Thanks for that added detail.



    That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.



    As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.



    Once unpacked, you should get a file with .tsv extension like IMBD indicates.



    If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.



    Hope that helps! Feel free to comment with your progress.






    share|improve this answer





















    • 2





      No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

      – dave_thompson_085
      Feb 5 at 5:35








    • 1





      Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

      – baelx
      Feb 5 at 5:48














    1












    1








    1







    Thanks for that added detail.



    That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.



    As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.



    Once unpacked, you should get a file with .tsv extension like IMBD indicates.



    If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.



    Hope that helps! Feel free to comment with your progress.






    share|improve this answer















    Thanks for that added detail.



    That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.



    As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.



    Once unpacked, you should get a file with .tsv extension like IMBD indicates.



    If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.



    Hope that helps! Feel free to comment with your progress.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 5 at 6:29









    grawity

    240k37508562




    240k37508562










    answered Feb 5 at 5:05









    baelxbaelx

    1,438616




    1,438616








    • 2





      No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

      – dave_thompson_085
      Feb 5 at 5:35








    • 1





      Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

      – baelx
      Feb 5 at 5:48














    • 2





      No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

      – dave_thompson_085
      Feb 5 at 5:35








    • 1





      Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

      – baelx
      Feb 5 at 5:48








    2




    2





    No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

    – dave_thompson_085
    Feb 5 at 5:35







    No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.

    – dave_thompson_085
    Feb 5 at 5:35






    1




    1





    Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

    – baelx
    Feb 5 at 5:48





    Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer

    – baelx
    Feb 5 at 5:48













    1














    I have downloaded and used title.ratings.tsv.gz. There is no problem.



    These the steps, to open it:




    • uncompress it (if you are a win user, you can use 7zip utility);

    • than simply open it.


    If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).



    In NotePad it appears is in this way



    enter image description here






    share|improve this answer






























      1














      I have downloaded and used title.ratings.tsv.gz. There is no problem.



      These the steps, to open it:




      • uncompress it (if you are a win user, you can use 7zip utility);

      • than simply open it.


      If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).



      In NotePad it appears is in this way



      enter image description here






      share|improve this answer




























        1












        1








        1







        I have downloaded and used title.ratings.tsv.gz. There is no problem.



        These the steps, to open it:




        • uncompress it (if you are a win user, you can use 7zip utility);

        • than simply open it.


        If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).



        In NotePad it appears is in this way



        enter image description here






        share|improve this answer















        I have downloaded and used title.ratings.tsv.gz. There is no problem.



        These the steps, to open it:




        • uncompress it (if you are a win user, you can use 7zip utility);

        • than simply open it.


        If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).



        In NotePad it appears is in this way



        enter image description here







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Feb 5 at 7:56

























        answered Feb 5 at 7:36









        aborrusoaborruso

        1114




        1114






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Super User!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402103%2freally-weird-utf-8-code%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

            Mangá

             ⁒  ․,‪⁊‑⁙ ⁖, ⁇‒※‌, †,⁖‗‌⁝    ‾‸⁘,‖⁔⁣,⁂‾
”‑,‥–,‬ ,⁀‹⁋‴⁑ ‒ ,‴⁋”‼ ⁨,‷⁔„ ‰′,‐‚ ‥‡‎“‷⁃⁨⁅⁣,⁔
⁇‘⁔⁡⁏⁌⁡‿‶‏⁨ ⁣⁕⁖⁨⁩⁥‽⁀  ‴‬⁜‟ ⁃‣‧⁕‮ …‍⁨‴ ⁩,⁚⁖‫ ,‵ ⁀,‮⁝‣‣ ⁑  ⁂– ․, ‾‽ ‏⁁“⁗‸ ‾… ‹‡⁌⁎‸‘ ‡⁏⁌‪ ‵⁛ ‎⁨ ―⁦⁤⁄⁕