Really weird “UTF-8” code
I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?
python csv database utf-8
add a comment |
I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?
python csv database utf-8
1
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
1
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56
add a comment |
I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?
python csv database utf-8
I downloaded a database from imdb in the form of a tsv.gz (csv) file. Imdb said that the file was in UTF-8 (https://www.imdb.com/interfaces/?ref_=login), but when I looked at the file in NotePad and in Excel, it was a bunch of Chinese letters/symbols, so I'm assuming I cannot use it in Python. Does anyone know what happened or what to do?
python csv database utf-8
python csv database utf-8
edited Feb 5 at 4:55
JTalbott
asked Feb 5 at 4:43
JTalbottJTalbott
11
11
1
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
1
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56
add a comment |
1
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
1
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56
1
1
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
1
1
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56
add a comment |
2 Answers
2
active
oldest
votes
Thanks for that added detail.
That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.
As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.
Once unpacked, you should get a file with .tsv extension like IMBD indicates.
If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.
Hope that helps! Feel free to comment with your progress.
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
add a comment |
I have downloaded and used title.ratings.tsv.gz. There is no problem.
These the steps, to open it:
- uncompress it (if you are a win user, you can use 7zip utility);
- than simply open it.
If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).
In NotePad it appears is in this way
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402103%2freally-weird-utf-8-code%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for that added detail.
That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.
As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.
Once unpacked, you should get a file with .tsv extension like IMBD indicates.
If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.
Hope that helps! Feel free to comment with your progress.
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
add a comment |
Thanks for that added detail.
That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.
As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.
Once unpacked, you should get a file with .tsv extension like IMBD indicates.
If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.
Hope that helps! Feel free to comment with your progress.
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
add a comment |
Thanks for that added detail.
That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.
As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.
Once unpacked, you should get a file with .tsv extension like IMBD indicates.
If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.
Hope that helps! Feel free to comment with your progress.
Thanks for that added detail.
That file you've downloaded is compressed using gzip and if you try to view it as-is, it'll get interpreted as those characters you're seeing. You'll need to unpack it before you can view the text in Notepad or Excel.
As Dave mentions below, you should be able to make use of any number of zip/archiving tool to unzip it. You might also want to google "unpack .gz file on windows" and follow the steps.
Once unpacked, you should get a file with .tsv extension like IMBD indicates.
If you've already unpacked the .gz file and you're still seeing odd characters, you may not simply need to open the file but import it into Excel. For that, see the following guide.
Hope that helps! Feel free to comment with your progress.
edited Feb 5 at 6:29
grawity
240k37508562
240k37508562
answered Feb 5 at 5:05
baelxbaelx
1,438616
1,438616
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
add a comment |
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
2
2
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
No, it's not a tar. As the imdb page says, and the extension .tsv.gz indicates, it's a TSV file compressed with gzip. Any file can be processed with gzip (although some will not actually get shorter, so it's silly to do them) and while it's common to gzip files that are actually archives in tar format, it's also common to gzip other kinds of files. WinZip can indeed uncompress any .gz not just .tar.gz, and I believe 7zip also, and Excel can indeed open the (uncompressed) TSV -- if you have enough memory, because these files are large.
– dave_thompson_085
Feb 5 at 5:35
1
1
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
Oops! Right you are Dave. That’s just a mistake on my part but I imagine the underlying problem is the same and that an initial uncompressing is necessary before importing the file to Excel. I’ll update my answer
– baelx
Feb 5 at 5:48
add a comment |
I have downloaded and used title.ratings.tsv.gz. There is no problem.
These the steps, to open it:
- uncompress it (if you are a win user, you can use 7zip utility);
- than simply open it.
If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).
In NotePad it appears is in this way
add a comment |
I have downloaded and used title.ratings.tsv.gz. There is no problem.
These the steps, to open it:
- uncompress it (if you are a win user, you can use 7zip utility);
- than simply open it.
If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).
In NotePad it appears is in this way
add a comment |
I have downloaded and used title.ratings.tsv.gz. There is no problem.
These the steps, to open it:
- uncompress it (if you are a win user, you can use 7zip utility);
- than simply open it.
If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).
In NotePad it appears is in this way
I have downloaded and used title.ratings.tsv.gz. There is no problem.
These the steps, to open it:
- uncompress it (if you are a win user, you can use 7zip utility);
- than simply open it.
If you use Excel you must use import process (http://www.arj.no/2013/06/28/how-to-import-tsv-file-in-ms-excel/).
In NotePad it appears is in this way
edited Feb 5 at 7:56
answered Feb 5 at 7:36
aborrusoaborruso
1114
1114
add a comment |
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402103%2freally-weird-utf-8-code%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Hey there, welcome to Superuser! Consider adding some specific steps you took to open the file to help you get the best answer. For example, you made a reference to opening the file with python. Could you expand your answer to include which methods or libraries you might have used? The more specific you can be on your steps up until now, the faster someone will be able to understand your exact issue. Also, consider providing links to where "Imdb said that the file was in UTF-8..." so people can have a look at the references you're using. :)
– baelx
Feb 5 at 4:52
1
I added the link, but I did not try to read it in Python because it seems useless, and there are so many lines of data that I don't want to waste time. Also thanks for the tips.
– JTalbott
Feb 5 at 4:56