Need to create a dataset on news using python
I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code
import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
records=
def cnbc(base_url):
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))
cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")
but this is only allowing me to extract one news.
Can anyone tell me that how can I extract all the news url from the root directory of website.
python beautifulsoup dataset
add a comment |
I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code
import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
records=
def cnbc(base_url):
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))
cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")
but this is only allowing me to extract one news.
Can anyone tell me that how can I extract all the news url from the root directory of website.
python beautifulsoup dataset
of course, you need to get string fromhttps://www.cnbc.com/
for all latest news.
– ewwink
Nov 10 '18 at 12:45
add a comment |
I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code
import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
records=
def cnbc(base_url):
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))
cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")
but this is only allowing me to extract one news.
Can anyone tell me that how can I extract all the news url from the root directory of website.
python beautifulsoup dataset
I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code
import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
records=
def cnbc(base_url):
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))
cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")
but this is only allowing me to extract one news.
Can anyone tell me that how can I extract all the news url from the root directory of website.
python beautifulsoup dataset
python beautifulsoup dataset
asked Nov 10 '18 at 9:19
Ahmed
11
11
of course, you need to get string fromhttps://www.cnbc.com/
for all latest news.
– ewwink
Nov 10 '18 at 12:45
add a comment |
of course, you need to get string fromhttps://www.cnbc.com/
for all latest news.
– ewwink
Nov 10 '18 at 12:45
of course, you need to get string from
https://www.cnbc.com/
for all latest news.– ewwink
Nov 10 '18 at 12:45
of course, you need to get string from
https://www.cnbc.com/
for all latest news.– ewwink
Nov 10 '18 at 12:45
add a comment |
2 Answers
2
active
oldest
votes
This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""
def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories
def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links
def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)
categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue
print(articles_list)
add a comment |
There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div
class is headline news_headline = soup.find_all("div",class_="headline")
. Then check if element is what we want.
new =
for div in news_headline:
each = ()
if div.a:
each[0] = url + div.a.get("href")
if div.a.text:
# use split to remove t n blankspace
each[1] = " ".join(div.a.text.split())
else:
each[1] = " ".join(div.a.get("title").split())
new.append(each)
else:
continue
It is the full code but i wrote this as short as i can.
import requests
from bs4 import BeautifulSoup
def index(url="https://www.cnbc.com/world/"):
with requests.Session() as se:
se.encoding = "UTF-8"
res = se.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml")
news_headline = soup.find_all("div",class_="headline")
news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
print(news_)
index()
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Likehttps://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into myindex()
and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""
def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories
def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links
def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)
categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue
print(articles_list)
add a comment |
This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""
def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories
def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links
def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)
categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue
print(articles_list)
add a comment |
This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""
def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories
def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links
def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)
categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue
print(articles_list)
This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""
def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories
def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links
def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)
categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue
print(articles_list)
answered Nov 10 '18 at 20:57
jaskowitchious
61
61
add a comment |
add a comment |
There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div
class is headline news_headline = soup.find_all("div",class_="headline")
. Then check if element is what we want.
new =
for div in news_headline:
each = ()
if div.a:
each[0] = url + div.a.get("href")
if div.a.text:
# use split to remove t n blankspace
each[1] = " ".join(div.a.text.split())
else:
each[1] = " ".join(div.a.get("title").split())
new.append(each)
else:
continue
It is the full code but i wrote this as short as i can.
import requests
from bs4 import BeautifulSoup
def index(url="https://www.cnbc.com/world/"):
with requests.Session() as se:
se.encoding = "UTF-8"
res = se.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml")
news_headline = soup.find_all("div",class_="headline")
news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
print(news_)
index()
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Likehttps://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into myindex()
and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28
add a comment |
There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div
class is headline news_headline = soup.find_all("div",class_="headline")
. Then check if element is what we want.
new =
for div in news_headline:
each = ()
if div.a:
each[0] = url + div.a.get("href")
if div.a.text:
# use split to remove t n blankspace
each[1] = " ".join(div.a.text.split())
else:
each[1] = " ".join(div.a.get("title").split())
new.append(each)
else:
continue
It is the full code but i wrote this as short as i can.
import requests
from bs4 import BeautifulSoup
def index(url="https://www.cnbc.com/world/"):
with requests.Session() as se:
se.encoding = "UTF-8"
res = se.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml")
news_headline = soup.find_all("div",class_="headline")
news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
print(news_)
index()
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Likehttps://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into myindex()
and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28
add a comment |
There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div
class is headline news_headline = soup.find_all("div",class_="headline")
. Then check if element is what we want.
new =
for div in news_headline:
each = ()
if div.a:
each[0] = url + div.a.get("href")
if div.a.text:
# use split to remove t n blankspace
each[1] = " ".join(div.a.text.split())
else:
each[1] = " ".join(div.a.get("title").split())
new.append(each)
else:
continue
It is the full code but i wrote this as short as i can.
import requests
from bs4 import BeautifulSoup
def index(url="https://www.cnbc.com/world/"):
with requests.Session() as se:
se.encoding = "UTF-8"
res = se.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml")
news_headline = soup.find_all("div",class_="headline")
news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
print(news_)
index()
There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div
class is headline news_headline = soup.find_all("div",class_="headline")
. Then check if element is what we want.
new =
for div in news_headline:
each = ()
if div.a:
each[0] = url + div.a.get("href")
if div.a.text:
# use split to remove t n blankspace
each[1] = " ".join(div.a.text.split())
else:
each[1] = " ".join(div.a.get("title").split())
new.append(each)
else:
continue
It is the full code but i wrote this as short as i can.
import requests
from bs4 import BeautifulSoup
def index(url="https://www.cnbc.com/world/"):
with requests.Session() as se:
se.encoding = "UTF-8"
res = se.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml")
news_headline = soup.find_all("div",class_="headline")
news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
print(news_)
index()
answered Nov 11 '18 at 4:36
kcorlidy
2,1702318
2,1702318
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Likehttps://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into myindex()
and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28
add a comment |
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Likehttps://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into myindex()
and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53
Yes, as i said extract part of all news, we need a url or urls to do that. Like
https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed– kcorlidy
Nov 11 '18 at 9:05
Yes, as i said extract part of all news, we need a url or urls to do that. Like
https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/
. But i am not sure how many section did it have. @Ahmed– kcorlidy
Nov 11 '18 at 9:05
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40
@Ahmed you can extract them from menu, then put the url into my
index()
and extract each page of news– kcorlidy
Nov 12 '18 at 11:28
@Ahmed you can extract them from menu, then put the url into my
index()
and extract each page of news– kcorlidy
Nov 12 '18 at 11:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
of course, you need to get string from
https://www.cnbc.com/
for all latest news.– ewwink
Nov 10 '18 at 12:45