Need to create a dataset on news using python










0














I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests 
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question





















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 '18 at 12:45















0














I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests 
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question





















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 '18 at 12:45













0












0








0







I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests 
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question













I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests 
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.







python beautifulsoup dataset






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 '18 at 9:19









Ahmed

11




11











  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 '18 at 12:45
















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 '18 at 12:45















of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45




of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 '18 at 12:45












2 Answers
2






active

oldest

votes


















0














This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



import requests 
from bs4 import BeautifulSoup
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs="nav-menu-buttonText")
categories = [category.text for category in spans]
return categories

def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links

def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs="header_title last breadcrumb")
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue

print(articles_list)





share|improve this answer




























    0














    There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



    new = 
    for div in news_headline:
    each = ()
    if div.a:
    each[0] = url + div.a.get("href")
    if div.a.text:
    # use split to remove t n blankspace
    each[1] = " ".join(div.a.text.split())
    else:
    each[1] = " ".join(div.a.get("title").split())
    new.append(each)
    else:
    continue


    It is the full code but i wrote this as short as i can.



    import requests 
    from bs4 import BeautifulSoup

    def index(url="https://www.cnbc.com/world/"):
    with requests.Session() as se:
    se.encoding = "UTF-8"
    res = se.get(url)
    text = res.text
    soup = BeautifulSoup(text,"lxml")
    news_headline = soup.find_all("div",class_="headline")
    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
    print(news_)

    index()





    share|improve this answer




















    • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
      – Ahmed
      Nov 11 '18 at 8:53










    • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
      – kcorlidy
      Nov 11 '18 at 9:05










    • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
      – Ahmed
      Nov 12 '18 at 10:40










    • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
      – kcorlidy
      Nov 12 '18 at 11:28










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



    import requests 
    from bs4 import BeautifulSoup
    from datetime import datetime

    # https://www.xxxx.com"
    WEB_SITE_BASE_URL= ""
    # https://www.xxxx.com/?region=us
    WEB_SITE_REGION_URL = ""

    def get_categories(web_site_base_url):
    r = requests.get(web_site_base_url)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    spans = soup.find_all(attrs="nav-menu-buttonText")
    categories = [category.text for category in spans]
    return categories

    def get_links(category_url):
    r = requests.get(category_url)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    filtered_links = list(set([k for k in links if '/2018/11/' in k]))
    return filtered_links

    def news(link):
    r = requests.get(link)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
    content=' '
    for content_tag in soup.find_all("p"):
    content = content+content_tag.text.replace("r","").replace("n","")
    content= content[18:-458]
    Country ='United States'
    website='WEB_SITE_BASE_URL'
    comments=''
    date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
    d = datetime.strptime(date, "%d %b %Y")
    date = d.strftime("%d-%m-%Y")
    spans = soup.find_all(attrs="header_title last breadcrumb")
    categories = [category.text for category in spans]
    genre = categories
    return(Title,content,Country,website,comments,genre,date)

    categories = get_categories(WEB_SITE_REGION_URL)
    list_of_link_lists =
    for category in categories:
    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
    flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
    articles_list =
    for link in flat_link_list:
    try:
    articles_list.append(news(WEB_SITE_BASE_URL + link))
    except:
    print("Something was wrong")
    continue

    print(articles_list)





    share|improve this answer

























      0














      This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



      import requests 
      from bs4 import BeautifulSoup
      from datetime import datetime

      # https://www.xxxx.com"
      WEB_SITE_BASE_URL= ""
      # https://www.xxxx.com/?region=us
      WEB_SITE_REGION_URL = ""

      def get_categories(web_site_base_url):
      r = requests.get(web_site_base_url)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      spans = soup.find_all(attrs="nav-menu-buttonText")
      categories = [category.text for category in spans]
      return categories

      def get_links(category_url):
      r = requests.get(category_url)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      links = [a.get('href') for a in soup.find_all('a', href=True)]
      filtered_links = list(set([k for k in links if '/2018/11/' in k]))
      return filtered_links

      def news(link):
      r = requests.get(link)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
      content=' '
      for content_tag in soup.find_all("p"):
      content = content+content_tag.text.replace("r","").replace("n","")
      content= content[18:-458]
      Country ='United States'
      website='WEB_SITE_BASE_URL'
      comments=''
      date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
      d = datetime.strptime(date, "%d %b %Y")
      date = d.strftime("%d-%m-%Y")
      spans = soup.find_all(attrs="header_title last breadcrumb")
      categories = [category.text for category in spans]
      genre = categories
      return(Title,content,Country,website,comments,genre,date)

      categories = get_categories(WEB_SITE_REGION_URL)
      list_of_link_lists =
      for category in categories:
      list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
      flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
      articles_list =
      for link in flat_link_list:
      try:
      articles_list.append(news(WEB_SITE_BASE_URL + link))
      except:
      print("Something was wrong")
      continue

      print(articles_list)





      share|improve this answer























        0












        0








        0






        This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



        import requests 
        from bs4 import BeautifulSoup
        from datetime import datetime

        # https://www.xxxx.com"
        WEB_SITE_BASE_URL= ""
        # https://www.xxxx.com/?region=us
        WEB_SITE_REGION_URL = ""

        def get_categories(web_site_base_url):
        r = requests.get(web_site_base_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        spans = soup.find_all(attrs="nav-menu-buttonText")
        categories = [category.text for category in spans]
        return categories

        def get_links(category_url):
        r = requests.get(category_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        filtered_links = list(set([k for k in links if '/2018/11/' in k]))
        return filtered_links

        def news(link):
        r = requests.get(link)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
        content=' '
        for content_tag in soup.find_all("p"):
        content = content+content_tag.text.replace("r","").replace("n","")
        content= content[18:-458]
        Country ='United States'
        website='WEB_SITE_BASE_URL'
        comments=''
        date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
        d = datetime.strptime(date, "%d %b %Y")
        date = d.strftime("%d-%m-%Y")
        spans = soup.find_all(attrs="header_title last breadcrumb")
        categories = [category.text for category in spans]
        genre = categories
        return(Title,content,Country,website,comments,genre,date)

        categories = get_categories(WEB_SITE_REGION_URL)
        list_of_link_lists =
        for category in categories:
        list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
        flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
        articles_list =
        for link in flat_link_list:
        try:
        articles_list.append(news(WEB_SITE_BASE_URL + link))
        except:
        print("Something was wrong")
        continue

        print(articles_list)





        share|improve this answer












        This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



        import requests 
        from bs4 import BeautifulSoup
        from datetime import datetime

        # https://www.xxxx.com"
        WEB_SITE_BASE_URL= ""
        # https://www.xxxx.com/?region=us
        WEB_SITE_REGION_URL = ""

        def get_categories(web_site_base_url):
        r = requests.get(web_site_base_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        spans = soup.find_all(attrs="nav-menu-buttonText")
        categories = [category.text for category in spans]
        return categories

        def get_links(category_url):
        r = requests.get(category_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        filtered_links = list(set([k for k in links if '/2018/11/' in k]))
        return filtered_links

        def news(link):
        r = requests.get(link)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        Title=soup.find("h1","class":"title").text.replace("r","").replace("n","")
        content=' '
        for content_tag in soup.find_all("p"):
        content = content+content_tag.text.replace("r","").replace("n","")
        content= content[18:-458]
        Country ='United States'
        website='WEB_SITE_BASE_URL'
        comments=''
        date= soup.find("time","class":"datestamp").text[35:-2].replace("r","").replace("n","")
        d = datetime.strptime(date, "%d %b %Y")
        date = d.strftime("%d-%m-%Y")
        spans = soup.find_all(attrs="header_title last breadcrumb")
        categories = [category.text for category in spans]
        genre = categories
        return(Title,content,Country,website,comments,genre,date)

        categories = get_categories(WEB_SITE_REGION_URL)
        list_of_link_lists =
        for category in categories:
        list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
        flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
        articles_list =
        for link in flat_link_list:
        try:
        articles_list.append(news(WEB_SITE_BASE_URL + link))
        except:
        print("Something was wrong")
        continue

        print(articles_list)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 10 '18 at 20:57









        jaskowitchious

        61




        61























            0














            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests 
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer




















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 '18 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 '18 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 '18 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 '18 at 11:28















            0














            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests 
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer




















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 '18 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 '18 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 '18 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 '18 at 11:28













            0












            0








            0






            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests 
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer












            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests 
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 11 '18 at 4:36









            kcorlidy

            2,1702318




            2,1702318











            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 '18 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 '18 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 '18 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 '18 at 11:28
















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 '18 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 '18 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 '18 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 '18 at 11:28















            kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
            – Ahmed
            Nov 11 '18 at 8:53




            kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
            – Ahmed
            Nov 11 '18 at 8:53












            Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
            – kcorlidy
            Nov 11 '18 at 9:05




            Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
            – kcorlidy
            Nov 11 '18 at 9:05












            Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
            – Ahmed
            Nov 12 '18 at 10:40




            Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
            – Ahmed
            Nov 12 '18 at 10:40












            @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
            – kcorlidy
            Nov 12 '18 at 11:28




            @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
            – kcorlidy
            Nov 12 '18 at 11:28

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

            Edmonton

            Crossroads (UK TV series)