html - How to scrape href with Python 3.5 and BeautifulSoup -
this question has answer here:
i want scrape href of every project website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 python 3.5 , beautifulsoup.
that's code
#loading libraries import urllib import urllib.request bs4 import beautifulsoup #define url scraping theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1" thepage = urllib.request.urlopen(theurl) #cooking soup soup = beautifulsoup(thepage,"html.parser") #scraping "link" (href) project_ref = soup.findall('h6', {'class': 'project-title'}) project_href = [project.findchildren('a')[0].href project in project_ref if project.findchildren('a')] print(project_href)
i [none, none, .... none, none] back. need list href class .
any ideas?
try this:
import urllib.request bs4 import beautifulsoup theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1" thepage = urllib.request.urlopen(theurl) soup = beautifulsoup(thepage) project_href = [i['href'] in soup.find_all('a', href=true)] print(project_href)
this return href
instances. see in link, lot of href
tags have #
inside them. can avoid these simple regex proper links, or ignore #
symboles.
project_href = [i['href'] in soup.find_all('a', href=true) if i['href'] != "#"]
this still give trash links /discover?ref=nav
, if want narrow down use proper regex links need.
edit:
to solve problem mentioned in comments:
soup = beautifulsoup(thepage) in soup.find_all('div', attrs={'class' : 'project-card-content'}): print(i.a['href'])
Comments
Post a Comment