Need to create a dataset on news using python

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests 
from bs4 import BeautifulSoup 
import pandas 
import csv 
from datetime import datetime

records=

def cnbc(base_url):

 r = requests.get(base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='https://www.cnbc.com/' 
 comments='' 
 genre='Political'
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y")
 records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 '18 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45

add a comment |

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests 
from bs4 import BeautifulSoup 
import pandas 
import csv 
from datetime import datetime

records=

def cnbc(base_url):

 r = requests.get(base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='https://www.cnbc.com/' 
 comments='' 
 genre='Political'
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y")
 records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 '18 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45

add a comment |

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests 
from bs4 import BeautifulSoup 
import pandas 
import csv 
from datetime import datetime

records=

def cnbc(base_url):

 r = requests.get(base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='https://www.cnbc.com/' 
 comments='' 
 genre='Political'
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y")
 records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 '18 at 9:19

Ahmed

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests 
from bs4 import BeautifulSoup 
import pandas 
import csv 
from datetime import datetime

records=

def cnbc(base_url):

 r = requests.get(base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='https://www.cnbc.com/' 
 comments='' 
 genre='Political'
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y")
 records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

python beautifulsoup dataset

asked Nov 10 '18 at 9:19

Ahmed

asked Nov 10 '18 at 9:19

Ahmed

asked Nov 10 '18 at 9:19

Ahmed

asked Nov 10 '18 at 9:19

Ahmed

asked Nov 10 '18 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45

add a comment |

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45

add a comment |

2 Answers
2

active

oldest

votes

This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.

import requests 
from bs4 import BeautifulSoup 
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
 r = requests.get(web_site_base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 spans = soup.find_all(attrs="nav-menu-buttonText")
 categories = [category.text for category in spans]
 return categories

def get_links(category_url):
 r = requests.get(category_url)
 c = r.content 
 soup = BeautifulSoup(c,"html.parser")
 links = [a.get('href') for a in soup.find_all('a', href=True)]
 filtered_links = list(set([k for k in links if '/2018/11/' in k]))
 return filtered_links

def news(link):
 r = requests.get(link) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='WEB_SITE_BASE_URL' 
 comments=''
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y") 
 spans = soup.find_all(attrs="header_title last breadcrumb")
 categories = [category.text for category in spans]
 genre = categories
 return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists = 
for category in categories:
 list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list = 
for link in flat_link_list:
 try:
 articles_list.append(news(WEB_SITE_BASE_URL + link))
 except:
 print("Something was wrong")
 continue

print(articles_list)

answered Nov 10 '18 at 20:57

jaskowitchious

add a comment |

There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.

new = 
for div in news_headline:
 each = ()
 if div.a:
 each[0] = url + div.a.get("href")
 if div.a.text:
 # use split to remove t n blankspace
 each[1] = " ".join(div.a.text.split())
 else:
 each[1] = " ".join(div.a.get("title").split())
 new.append(each)
 else:
 continue

It is the full code but i wrote this as short as i can.

import requests 
from bs4 import BeautifulSoup 

def index(url="https://www.cnbc.com/world/"):
 with requests.Session() as se:
 se.encoding = "UTF-8"
 res = se.get(url)
 text = res.text
 soup = BeautifulSoup(text,"lxml")
 news_headline = soup.find_all("div",class_="headline")
 news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
 print(news_)

index()

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

import requests 
from bs4 import BeautifulSoup 
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
 r = requests.get(web_site_base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 spans = soup.find_all(attrs="nav-menu-buttonText")
 categories = [category.text for category in spans]
 return categories

def get_links(category_url):
 r = requests.get(category_url)
 c = r.content 
 soup = BeautifulSoup(c,"html.parser")
 links = [a.get('href') for a in soup.find_all('a', href=True)]
 filtered_links = list(set([k for k in links if '/2018/11/' in k]))
 return filtered_links

def news(link):
 r = requests.get(link) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='WEB_SITE_BASE_URL' 
 comments=''
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y") 
 spans = soup.find_all(attrs="header_title last breadcrumb")
 categories = [category.text for category in spans]
 genre = categories
 return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists = 
for category in categories:
 list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list = 
for link in flat_link_list:
 try:
 articles_list.append(news(WEB_SITE_BASE_URL + link))
 except:
 print("Something was wrong")
 continue

print(articles_list)

answered Nov 10 '18 at 20:57

jaskowitchious

add a comment |

import requests 
from bs4 import BeautifulSoup 
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
 r = requests.get(web_site_base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 spans = soup.find_all(attrs="nav-menu-buttonText")
 categories = [category.text for category in spans]
 return categories

def get_links(category_url):
 r = requests.get(category_url)
 c = r.content 
 soup = BeautifulSoup(c,"html.parser")
 links = [a.get('href') for a in soup.find_all('a', href=True)]
 filtered_links = list(set([k for k in links if '/2018/11/' in k]))
 return filtered_links

def news(link):
 r = requests.get(link) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='WEB_SITE_BASE_URL' 
 comments=''
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y") 
 spans = soup.find_all(attrs="header_title last breadcrumb")
 categories = [category.text for category in spans]
 genre = categories
 return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists = 
for category in categories:
 list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list = 
for link in flat_link_list:
 try:
 articles_list.append(news(WEB_SITE_BASE_URL + link))
 except:
 print("Something was wrong")
 continue

print(articles_list)

answered Nov 10 '18 at 20:57

jaskowitchious

add a comment |

import requests 
from bs4 import BeautifulSoup 
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
 r = requests.get(web_site_base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 spans = soup.find_all(attrs="nav-menu-buttonText")
 categories = [category.text for category in spans]
 return categories

def get_links(category_url):
 r = requests.get(category_url)
 c = r.content 
 soup = BeautifulSoup(c,"html.parser")
 links = [a.get('href') for a in soup.find_all('a', href=True)]
 filtered_links = list(set([k for k in links if '/2018/11/' in k]))
 return filtered_links

def news(link):
 r = requests.get(link) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='WEB_SITE_BASE_URL' 
 comments=''
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y") 
 spans = soup.find_all(attrs="header_title last breadcrumb")
 categories = [category.text for category in spans]
 genre = categories
 return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists = 
for category in categories:
 list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list = 
for link in flat_link_list:
 try:
 articles_list.append(news(WEB_SITE_BASE_URL + link))
 except:
 print("Something was wrong")
 continue

print(articles_list)

answered Nov 10 '18 at 20:57

jaskowitchious

import requests 
from bs4 import BeautifulSoup 
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
 r = requests.get(web_site_base_url) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 spans = soup.find_all(attrs="nav-menu-buttonText")
 categories = [category.text for category in spans]
 return categories

def get_links(category_url):
 r = requests.get(category_url)
 c = r.content 
 soup = BeautifulSoup(c,"html.parser")
 links = [a.get('href') for a in soup.find_all('a', href=True)]
 filtered_links = list(set([k for k in links if '/2018/11/' in k]))
 return filtered_links

def news(link):
 r = requests.get(link) 
 c = r.content 
 soup = BeautifulSoup(c,"html.parser") 
 Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
 content=' '
 for content_tag in soup.find_all("p"):
 content = content+content_tag.text.replace("r","").replace("n","")
 content= content[18:-458]
 Country ='United States'
 website='WEB_SITE_BASE_URL' 
 comments=''
 date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
 d = datetime.strptime(date, "%d %b %Y")
 date = d.strftime("%d-%m-%Y") 
 spans = soup.find_all(attrs="header_title last breadcrumb")
 categories = [category.text for category in spans]
 genre = categories
 return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists = 
for category in categories:
 list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list = 
for link in flat_link_list:
 try:
 articles_list.append(news(WEB_SITE_BASE_URL + link))
 except:
 print("Something was wrong")
 continue

print(articles_list)

answered Nov 10 '18 at 20:57

jaskowitchious

answered Nov 10 '18 at 20:57

jaskowitchious

answered Nov 10 '18 at 20:57

jaskowitchious

answered Nov 10 '18 at 20:57

jaskowitchious

add a comment |

new = 
for div in news_headline:
 each = ()
 if div.a:
 each[0] = url + div.a.get("href")
 if div.a.text:
 # use split to remove t n blankspace
 each[1] = " ".join(div.a.text.split())
 else:
 each[1] = " ".join(div.a.get("title").split())
 new.append(each)
 else:
 continue

It is the full code but i wrote this as short as i can.

import requests 
from bs4 import BeautifulSoup 

def index(url="https://www.cnbc.com/world/"):
 with requests.Session() as se:
 se.encoding = "UTF-8"
 res = se.get(url)
 text = res.text
 soup = BeautifulSoup(text,"lxml")
 news_headline = soup.find_all("div",class_="headline")
 news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
 print(news_)

index()

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

add a comment |

new = 
for div in news_headline:
 each = ()
 if div.a:
 each[0] = url + div.a.get("href")
 if div.a.text:
 # use split to remove t n blankspace
 each[1] = " ".join(div.a.text.split())
 else:
 each[1] = " ".join(div.a.get("title").split())
 new.append(each)
 else:
 continue

It is the full code but i wrote this as short as i can.

import requests 
from bs4 import BeautifulSoup 

def index(url="https://www.cnbc.com/world/"):
 with requests.Session() as se:
 se.encoding = "UTF-8"
 res = se.get(url)
 text = res.text
 soup = BeautifulSoup(text,"lxml")
 news_headline = soup.find_all("div",class_="headline")
 news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
 print(news_)

index()

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

add a comment |

new = 
for div in news_headline:
 each = ()
 if div.a:
 each[0] = url + div.a.get("href")
 if div.a.text:
 # use split to remove t n blankspace
 each[1] = " ".join(div.a.text.split())
 else:
 each[1] = " ".join(div.a.get("title").split())
 new.append(each)
 else:
 continue

It is the full code but i wrote this as short as i can.

import requests 
from bs4 import BeautifulSoup 

def index(url="https://www.cnbc.com/world/"):
 with requests.Session() as se:
 se.encoding = "UTF-8"
 res = se.get(url)
 text = res.text
 soup = BeautifulSoup(text,"lxml")
 news_headline = soup.find_all("div",class_="headline")
 news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
 print(news_)

index()

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

new = 
for div in news_headline:
 each = ()
 if div.a:
 each[0] = url + div.a.get("href")
 if div.a.text:
 # use split to remove t n blankspace
 each[1] = " ".join(div.a.text.split())
 else:
 each[1] = " ".join(div.a.get("title").split())
 new.append(each)
 else:
 continue

It is the full code but i wrote this as short as i can.

import requests 
from bs4 import BeautifulSoup 

def index(url="https://www.cnbc.com/world/"):
 with requests.Session() as se:
 se.encoding = "UTF-8"
 res = se.get(url)
 text = res.text
 soup = BeautifulSoup(text,"lxml")
 news_headline = soup.find_all("div",class_="headline")
 news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
 print(news_)

index()

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

answered Nov 11 '18 at 4:36

kcorlidy

2,1702318

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

add a comment |

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 '18 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 '18 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 '18 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 '18 at 11:28

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt