html - How to scrape href with Python 3.5 and BeautifulSoup -


this question has answer here:

i want scrape href of every project website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 python 3.5 , beautifulsoup.

that's code

#loading libraries  import urllib  import urllib.request  bs4 import beautifulsoup    #define url scraping  theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"  thepage = urllib.request.urlopen(theurl)    #cooking soup  soup = beautifulsoup(thepage,"html.parser")      #scraping "link" (href)  project_ref = soup.findall('h6', {'class': 'project-title'})  project_href = [project.findchildren('a')[0].href project in project_ref if project.findchildren('a')]  print(project_href)

i [none, none, .... none, none] back. need list href class .

any ideas?

try this:

import urllib.request bs4 import beautifulsoup  theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1" thepage = urllib.request.urlopen(theurl)  soup = beautifulsoup(thepage)  project_href = [i['href'] in soup.find_all('a', href=true)] print(project_href) 

this return href instances. see in link, lot of href tags have # inside them. can avoid these simple regex proper links, or ignore # symboles.

project_href = [i['href'] in soup.find_all('a', href=true) if i['href'] != "#"] 

this still give trash links /discover?ref=nav, if want narrow down use proper regex links need.

edit:

to solve problem mentioned in comments:

soup = beautifulsoup(thepage) in soup.find_all('div', attrs={'class' : 'project-card-content'}):     print(i.a['href']) 

Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -