News Scraping

News Scraping#

# add path to the system so that my package can be found
import sys
sys.path.append('../../')

# import my package
from newscrape.config import CONFIG; CONFIG.load('../../config.toml')
from newscrape.db import NewsDBClient
from newscrape.webdriver import WebDriver
from newscrape.scraper.headline import NewsHeadlinePicker
from newscrape.scraper import NewsScraper

Create a News Scraper#

scraper = NewsScraper(
    db_client=NewsDBClient.from_host_and_port(
        database_name=CONFIG.MONGODB_DATABASE_NAME,
        host=CONFIG.MONGODB_HOST,
        port=CONFIG.MONGODB_PORT,
    ),
    web_driver=WebDriver.on_port(0),
    headline_picker=NewsHeadlinePicker(),
    n_workers=10
)

We need the following parameters to initialize a news scraper:

db_client: A MongoDB client that handles the news documents in the database.
web_driver: A Chrome web driver running at the background. It will help us to access some websites when a simple GET request fails.
headline_picker: The news headline is usually wrapped into a h1 HTML tag. The problem is that, in practice, some websites may have multiple h1 tags, which may contain texts other than the headline. Hence, we need to pick the correct one. In order to complete this task automatically, the headline picker is powered by GPT.
n_workers: Maximum number of workers in a thread pool. The pool executor will send request concurrently to save our time.

Scrape News#

The goal is to scrape news in recent days, and then store them in MongoDB.

Get the current date:

from datetime import date, timedelta

today = date.today()
today

datetime.date(2023, 7, 14)

For example, suppose you want to

search for news related to the query 'Pwc aspen digital report',
in the past 3 days, and
all results should be in English,

then

scraper.scrape_news(
    query='Pwc aspen digital report',
    date_start=today - timedelta(days=2),
    date_end=today,
    language='en'
)

The following are more examples:

scraper.scrape_news(
    query='Pwc digital asset custody report',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='en'
)

scraper.scrape_news(
    query='Pwc aspen digital',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='en'
)

scraper.scrape_news(
    query='羅兵咸永道 aspen digital',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='zh'
)

scraper.scrape_news(
    query='羅兵咸永道 數字資產託管狀況報告',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='zh'
)

Fetch Documents From MongoDB#

Get the database client from the scraper instance:

db_client = scraper.db_client

You may want to fetch all newly inserted news in the past several hours:

from newscrape.schema.news import FIELDS_OF_INTEREST

news_list = db_client.find_news_inserted_in_past_n_hours(
    
    # newly inserted news in the past 12 hours
    hours=12,
    
    # only the following fields are of interest
    fields=FIELDS_OF_INTEREST
    
)

news_list

[{'date': '2023-07-11',
  'publication': 'Yahoo Finance',
  'headline': 'Asian Institutional Investors Look for Third Party Custody Solutions to Navigate Digital Asset Ecosystem, Find The PwC Aspen State of Digital Asset Custody Report',
  'link': 'https://finance.yahoo.com/news/asian-institutional-investors-look-third-123900424.html'},
 {'date': '2023-07-11',
  'publication': 'finews.asia',
  'headline': 'PwC: Better Custody Key to Institutional Crypto Adoption',
  'link': 'https://www.finews.asia/finance/39663-pwc-custody-key-to-institutional-crypto-adoption'},
 {'date': '2023-07-11',
  'publication': 'The Asset',
  'headline': 'Custody key to institutions picking digital assets',
  'link': 'https://www.theasset.com/article/49566/custody-key-to-institutions-picking-digital-assets'},
 {'date': '2023-07-11',
  'publication': 'Cointelegraph',
  'headline': 'Report: China to tighten rules around releasing generative AI tools',
  'link': 'https://cointelegraph.com/news/china-to-require-generative-ai-tools-to-obtain-license'},
 {'date': '2023-07-12',
  'publication': 'South China Morning Post',
  'headline': 'Hedge funds retreat from crypto after 2022 FTX scandal, PwC survey finds',
  'link': 'https://www.scmp.com/tech/tech-trends/article/3227451/hedge-funds-retreat-cryptocurrencies-after-2022-ftx-scandal-us-regulatory-crackdown-pwc-survey'},
 {'date': '2023-07-12',
  'publication': 'Cointelegraph',
  'headline': 'Google Play Store officially allows NFT games, but not gambling ones',
  'link': 'https://cointelegraph.com/news/google-play-store-officially-allows-nft-games-but-not-gambling'},
 {'date': '2023-07-11',
  'publication': 'Cointelegraph',
  'headline': 'Crypto custody market reached $448 billion in 2022: Report',
  'link': 'https://cointelegraph.com/news/crypto-report-the-crypto-custody-market-reached-448-billion-in-2022'},
 {'date': '2023-07-12',
  'publication': '新浪财经',
  'headline': '普华永道：82%高管寻求第三方托管加密资产愿意整合元宇宙',
  'link': 'https://finance.sina.com.cn/stock/hkstock/marketalerts/2023-07-12/doc-imzamnvv5551880.shtml'},
 {'date': '2023-07-11',
  'publication': '英为财情',
  'headline': '报告：数字资产行业已发展为价值1.2万亿美元市场',
  'link': 'https://cn.investing.com/news/stock-market-news/article-2219323'},
 {'date': '2023-07-12',
  'publication': '股票',
  'headline': '豪鹏科技: 世纪证券有限责任公司关于深圳市豪鹏科技股份有限公司向不特定对象发行可转换公司债券之上市保荐书（上会稿）',
  'link': 'https://stock.stockstar.com/notice/SN2023071300016626.shtml'}]

Convert the fetched news documents to a Pandas data frame:

import pandas as pd

df = pd.DataFrame(news_list)
df

	date	publication	headline	link
0	2023-07-11	Yahoo Finance	Asian Institutional Investors Look for Third P...	https://finance.yahoo.com/news/asian-instituti...
1	2023-07-11	finews.asia	PwC: Better Custody Key to Institutional Crypt...	https://www.finews.asia/finance/39663-pwc-cust...
2	2023-07-11	The Asset	Custody key to institutions picking digital as...	https://www.theasset.com/article/49566/custody...
3	2023-07-11	Cointelegraph	Report: China to tighten rules around releasin...	https://cointelegraph.com/news/china-to-requir...
4	2023-07-12	South China Morning Post	Hedge funds retreat from crypto after 2022 FTX...	https://www.scmp.com/tech/tech-trends/article/...
5	2023-07-12	Cointelegraph	Google Play Store officially allows NFT games,...	https://cointelegraph.com/news/google-play-sto...
6	2023-07-11	Cointelegraph	Crypto custody market reached $448 billion in ...	https://cointelegraph.com/news/crypto-report-t...
7	2023-07-12	新浪财经	普华永道：82%高管寻求第三方托管加密资产愿意整合元宇宙	https://finance.sina.com.cn/stock/hkstock/mark...
8	2023-07-11	英为财情	报告：数字资产行业已发展为价值1.2万亿美元市场	https://cn.investing.com/news/stock-market-new...
9	2023-07-12	股票	豪鹏科技: 世纪证券有限责任公司关于深圳市豪鹏科技股份有限公司向不特定对象发行可转换公司债券...	https://stock.stockstar.com/notice/SN202307130...

Finally, you want to write the above results into an Excel file:

# path to the output Execel file
excel_filepath = '../../out/news.xlsx'

df.to_excel(excel_filepath, index=False)

News Scraping

Contents

News Scraping#

Create a News Scraper#

Scrape News#

Fetch Documents From MongoDB#