UTF-8, not UTF-8 with Bow to read urls in CSV
https://stackoverflow.com/a/70139072/10789707unknown
python
4 years ago
3.6 kB
13
Indexable
Error #1 (line 10, print(row[0]), when it needed print(row) (as in at line 27))
Snippet #1: (returns KeyError: 'url')
import csv
with open('urlaa.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row[0])
Snippet #1 Error Output:
C:\Users\user\Desktop\urls>python urla.py
Traceback (most recent call last):
File "C:\Users\user\Desktop\urls\urla.py", line 6, in <module>
print(row["url"])
KeyError: 'url'
Error #2 (UTF-8 with BOM, need UTF-8 only)
Snippet #2: lines 35 & 36, (returns {'url' (UTF-8 with BOM characters))
with open('urlaa.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
print(row["url"])
Snippet #2 Error Output:
C:\Users\user\Desktop\urls>python urla.py
{'url': 'https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas'}
{'url': 'https://stackoverflow.com/questions/53486744/making-async-for-loops-in-python/53487199#53487199'}
Solution (create a UTF-8 encoded file and use print(row))
After Creating a new "urlaa.txt" file, pasting in it
import csv
with open('urlaa.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
and saving it as UTF-8 with Notepad,
then opening it with Notepad++ and saving it as csv (urlaa.csv)
Got correct output:
C:\Users\user\Desktop\urls>python urla.py
{'url': 'https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas'}
{'url': 'https://stackoverflow.com/questions/53486744/making-async-for-loops-in-python/53487199#53487199'}
Finally the code executed all right and returned the csv.
Working Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import csv
#GET TEXT
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup(["script", "style"]):
s.replaceWith('')
#
for p in soup.find_all('p')[1]:
lnk = p.get_text()
print(lnk)
#
# find body and extract text
p = soup.find("div", attrs={'class': 'article-content retro-folders'})
p.append(p.get_text())
x = p.text
y = x.replace("\r", "").replace("\n", "")
print(y)
# Compiling the info
lnktxt_data = [lnk, y]
# Append the info to the complete dataset
url_txt.append(lnktxt_data)
url_txt = []
#Get text from multiple urls
def main():
with open('urlaa.csv', newline='') as csv_file:
reader = csv.DictReader(csv_file)
urls = [row["url"] for row in reader]
txt = [getPageText(url) for url in urls]
for t in txt:
print(t)
if __name__=="__main__":
main()
#FRAME DATA
# Making the dataframe
url_txt = pd.DataFrame(url_txt, columns = ['lnk', 'y'])
url_txt.head()
#CREATE A FILE
# Save as CSV File
url_txt.to_csv('url_txt.csv',index=False)
Editor is loading...