UTF-8, not UTF-8 with Bow to read urls in CSV
https://stackoverflow.com/a/70139072/10789707unknown
python
3 years ago
3.6 kB
6
Indexable
Never
Error #1 (line 10, print(row[0]), when it needed print(row) (as in at line 27)) Snippet #1: (returns KeyError: 'url') import csv with open('urlaa.csv', newline='') as csvfile: reader = csv.DictReader(csvfile) for row in reader: print(row[0]) Snippet #1 Error Output: C:\Users\user\Desktop\urls>python urla.py Traceback (most recent call last): File "C:\Users\user\Desktop\urls\urla.py", line 6, in <module> print(row["url"]) KeyError: 'url' Error #2 (UTF-8 with BOM, need UTF-8 only) Snippet #2: lines 35 & 36, (returns {'url' (UTF-8 with BOM characters)) with open('urlaa.csv', newline='') as csvfile: reader = csv.DictReader(csvfile) for row in reader: print(row) print(row["url"]) Snippet #2 Error Output: C:\Users\user\Desktop\urls>python urla.py {'url': 'https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas'} {'url': 'https://stackoverflow.com/questions/53486744/making-async-for-loops-in-python/53487199#53487199'} Solution (create a UTF-8 encoded file and use print(row)) After Creating a new "urlaa.txt" file, pasting in it import csv with open('urlaa.csv', newline='') as csvfile: reader = csv.DictReader(csvfile) for row in reader: print(row) and saving it as UTF-8 with Notepad, then opening it with Notepad++ and saving it as csv (urlaa.csv) Got correct output: C:\Users\user\Desktop\urls>python urla.py {'url': 'https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas'} {'url': 'https://stackoverflow.com/questions/53486744/making-async-for-loops-in-python/53487199#53487199'} Finally the code executed all right and returned the csv. Working Code: from urllib.request import urlopen from bs4 import BeautifulSoup import numpy as np import pandas as pd import csv #GET TEXT def getPageText(url): # given a url, get page content data = urlopen(url).read() # parse as html structured document soup = BeautifulSoup(data, 'html.parser') # kill javascript content for s in soup(["script", "style"]): s.replaceWith('') # for p in soup.find_all('p')[1]: lnk = p.get_text() print(lnk) # # find body and extract text p = soup.find("div", attrs={'class': 'article-content retro-folders'}) p.append(p.get_text()) x = p.text y = x.replace("\r", "").replace("\n", "") print(y) # Compiling the info lnktxt_data = [lnk, y] # Append the info to the complete dataset url_txt.append(lnktxt_data) url_txt = [] #Get text from multiple urls def main(): with open('urlaa.csv', newline='') as csv_file: reader = csv.DictReader(csv_file) urls = [row["url"] for row in reader] txt = [getPageText(url) for url in urls] for t in txt: print(t) if __name__=="__main__": main() #FRAME DATA # Making the dataframe url_txt = pd.DataFrame(url_txt, columns = ['lnk', 'y']) url_txt.head() #CREATE A FILE # Save as CSV File url_txt.to_csv('url_txt.csv',index=False)