Measuring media coverage dynamics with MongoDB and Python

In a recent post, I outlined a process that aims to capture the dynamics of the media landscape by taking a snapshot of news headlines every 15 minutes. The database has been quietly growing for about a month now, receiving data scraped from 19 media organizations’ RSS feeds while the world goes about its business. Even though we only have a sliver of headlines history to work with (about 30 days), we’re ready for some analysis.

Let’s start by taking a look at an example document in the MongoDB database representing a snapshot of USA Today’s RSS feed on June 12, 2017:

{
  "_id": ObjectId("593e13134a5ac4327d88619c"),
  "datetime": datetime.datetime(2017, 6, 12, 4, 5, 32, 832000),
  "source": u"usa_today",
  "stories": [
    "'Dear Evan Hansen' wins six Tony Awards, including best musical",
    "Penguins have become NHL's newest dynasty",
    "First lady Melania Trump, son Barron officially move into the White House",
    "E3 2017: The 5 biggest reveals during the Xbox event",
    "Penguins repeat as Stanley Cup champions",
    "Homophobic slur aimed at U.S. goalkeeper Brad Guzan by Mexican fans at World Cup qualifier",
    "Ben Platt's reaction is the best GIF of the Tonys Awards",
    "Puerto Ricans parade in New York, back statehood",
    "Delta ends theater company sponsorship over Trump look-alike killing scene",
    "U.S. earns rare tie vs. Mexico in World Cup qualifier at Estadio Azteca"
  ]
}

Within the document, we see a list of stories (i.e. news headlines) alongside the source (USA Today) and the datetime when the headlines were observed. As noted above, these snapshots are created every 15 minutes with data from the RSS feeds of 19 media organizations (see the list of feeds here). At the time of writing, there are 29,925 of these snapshots in the database with headlines starting on June 12, 2017 (note: I lost all data between June 13 and July 5).

Querying Headlines with PyMongo and Pandas

In order to interact with the data, we construct a function, query_rss_stories() that wraps up the process of connecting to MongoDB, querying using the aggregation framework and regex, and organizing the results in a Pandas dataframe:

def query_rss_stories(regex):

    #connect to db
    client = MongoClient()
    news = client.news
    headlines = news.headlines

    #query the database, create dataframe
    pipeline = [
        {"$unwind":"$stories"},
        {"$match":{"stories":{'$regex':regex}}},
        {"$project":{"_id":0, "datetime":1, "stories":1, "source":1}}
    ]

    #unpack the cursor into a Dataframe
    cursor = headlines.aggregate(pipeline)
    df = pd.DataFrame([i for i in cursor])

    #round the rss query datetimes to 15mins
    df.datetime = df.datetime.dt.round('15min')
    raw_stories = df.set_index(['datetime'])

    return raw_stories

As an example, we’ll search for stories related to China:

stories = query_rss_stories('China')
stories.sample(n=5)
source stories
datetime
2017-07-20 14:00:00 breitbart Report: U.S. Bans Prompt American Allies to Bu...
2017-07-14 05:45:00 upi India rejects China's offer to mediate Kashmir...
2017-07-13 00:15:00 wsj_opinion How to Squeeze China
2017-07-07 19:00:00 abc US bombers fly over East and South China Seas
2017-07-12 19:15:00 reuters China dissident Liu's condition critical, brea...

The stories dataframe contains all headlines related to China from all of the new sources starting around July 6. For every hour that a headline is live on a particular RSS feed, four entries will be found in the stories dataframe (since the snapshots are taken at 15-minute intervals).

Next, we unstack the data and count the number of China-related headlines found at each time step:

def create_rss_timeseries(df, freq='1h'):
    #unstack sources and create headline_count time series
    ts = df.assign(headline_count = 1)
    ts = ts.groupby(['datetime', 'source']).sum()
    ts = ts.unstack(level=-1)
    ts.columns = ts.columns.get_level_values(1)

    #set time step
    ts = ts.assign(datetime = ts.index)
    ts = ts.groupby(pd.Grouper(key='datetime', freq=freq)).mean()

    return ts

#daily time series of China headline counts
ts = create_rss_timeseries(stories, freq='1d')

To visualize, we create a stacked bar chart of China-related headline counts across all media organizations: China-Headlines-Count Here, we can see that each day between July 7 and July 23, 2017, between 7 and 17 China-related headlines were found from 18 media sources. Sort of interesting, but what about a more volatile topic?

Let’s take a look at the volume of headlines related to “Trump Jr”. We’ll also increase the resolution thats visualized so we can see how the media coverage evolves each hour:

jr_stories = query_rss_stories('Trump Jr')
jr_ts = create_rss_timeseries(jr_stories, freq='1h')

Trump-Junior-Headlines-Count Now we’re seeing a major event play out.

Interpretations and Future Work

Recording periodic snapshots allows us to observe how the media’s interest in a particular topic changes over time. For example, a topic that’s discussed simultaneously in multiple articles by many media organizations must be especially important at a given point in time. Similarly, an issue that is featured throughout the RSS feeds for many days are likely more significant than stories that come and go within a few hours. One can think of these dynamics like the intensity and duration of rainfall events: is there are drizzle or a deluge of articles, and for how long are the stories pattering down?

I’ve only scratched the surface, but there is a lot more I’d like to investigate now that I have a handle on the data. Future Work:

  • measure the proportion of attention given to particular topics
  • quantify the differences in reporting between media organizations
  • compare media attention to objectively important and quantifiable topics:
    • casualties of war
    • bombings, shootings
    • environmental disasters
  • natural language analysis
    • differences in sentiment on particular topics
    • differences in sentiment between sources
  • identify any patterns between left- and right-leaning sources
  • host data on public server and/or data.world

Scraping headlines with cron jobs, Python, and MongoDB

I’ve been interested in measuring bias in media coverage for quite a while. The discourse before and after the 2016 election has forced a lot of us (definitely me) into an anxiety spiral, trying to keep up and maintain sanity under the weight of our 24-hour news culture. To make myself feel better, I recently set up a cron job on a server that pulls the RSS feeds of as many news sources as I could think of, and stores the headlines in a MongoDB database.

To set up the cron job, I edited my crontab on my Ubuntu server by typing crontab -e. In the crontab I added the following line:

*/15 * * * * /home/news_agg/scrape_news.py >/dev/null 2>&1

This basically tells the server to run a script located at /home/news_agg/scrape_news.py every 15 minutes. The >/dev/null 2>&1 part prevents the cron from writing a message to a log each time it runs.

The news scraping Python script that gets executed every 15 minutes is fairly straightforward. First, feedparser (for parsing RSS feeds), PyMongo (Python wrapper for workin with MongoDB), and the datetime packages are imported and a dictionary of RSS feeds is constructed:

#!/usr/bin/env python
import feedparser
import datetime
import pymongo

#create a dictionary of rss feeds
feeds = dict(
    nyt = r'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml',
    fox = r'http://feeds.foxnews.com/foxnews/most-popular',
    wsj_opinion = r'http://www.wsj.com/xml/rss/3_7041.xml',
    wsj_business = r'http://www.wsj.com/xml/rss/3_7014.xml',
    wsj_world = r'http://www.wsj.com/xml/rss/3_7085.xml',
    wapo_national = r'http://feeds.washingtonpost.com/rss/national',
    cnn = r'http://rss.cnn.com/rss/cnn_topstories.rss',
    cnn_us = r'http://rss.cnn.com/rss/cnn_us.rss',
    breitbart = r'http://feeds.feedburner.com/breitbart',
    cnbc = r'http://www.cnbc.com/id/100003114/device/rss/rss.html',
    abc = r'http://feeds.abcnews.com/abcnews/topstories',
    bbc = r'http://feeds.bbci.co.uk/news/rss.xml',
    wired = r'https://www.wired.com/feed/',
    upi = r'http://rss.upi.com/news/top_news.rss',
    reuters = r'http://feeds.reuters.com/reuters/topNews',
    usa_today = r'http://rssfeeds.usatoday.com/usatoday-NewsTopStories',
    ap = r'http://hosted2.ap.org/atom/APDEFAULT/3d281c11a96b4ad082fe88aa0db04305',
    npr = r'http://www.npr.org/rss/rss.php?id=1001',
    democracy_now = r'https://www.democracynow.org/democracynow.rss',
)

From there, we loop through each of the RSS urls, pull out the title and source of each article, and store everything in a temporary data list of dicts. That list then gets written to a MongoDB collection called headlines in a database called news:

#grab the current time
dt = datetime.datetime.utcnow()

data = []
for feed, url in feeds.iteritems():

    rss_parsed = feedparser.parse(url)
    titles = [art['title'] for art in rss_parsed['items']]

    #create dict for each news source
    d = {
        'source':feed,
        'stories':titles,
        'datetime':dt
    }
    data.append(d)

# Access the 'headlines' collection in the 'news' database
client = pymongo.MongoClient()
collection = client.news.headlines

#insert the data
collection.insert_many(data)

With the crontab in action, this script is run every 15 minutes which means we have a periodic snap shot of all of the new source’s RSS feeds. This effectively creates a times series of news headlines at a 15-minute time step.

This script was kicked off on June 12, 2017 (about 30 days before the day of this post). Since then, I’ve only scratched the surface with analysis. I’ve also realized that my Mongo document structure is probably pretty awkward, but, hey, it works. As an example, I set up a query to count the amount of times a particular topic was found in each new source’s RSS feed:

db.headlines.aggregate([
  {$unwind:"$stories"},
  {$match:{stories:/Yemen/}},
  {$group:{
    _id:"$source",
    stories:{$addToSet:"$stories"},
    count:{$sum:1}
  }},
]).pretty()

Here, I’ve queried the data for headlines including “Yemen”, which returned the following:

{
	"_id" : "nyt",
	"stories" : [
		"Cholera Spreads as War and Poverty Batter Yemen"
	],
	"count" : 87
}
{
	"_id" : "democracy_now",
	"stories" : [
		"Cholera Death Toll Tops 859 in War-Torn Yemen as U.S.-Backed Saudi Assault Continues"
	],
	"count" : 48
}
{
	"_id" : "fox",
	"stories" : [
		"Yemenis rally in support for secession of country's south",
		"Naval coalition steps up patrols around Yemen after attacks"
	],
	"count" : 2
}

This shows that, between June 12 and July 9, 2017, our script found that the RSS feeds of the New York Times, Democracy Now! and Fox News included stories about the war in Yemen 87, 48, and 2 times, respectively. Assuming each RSS feed snap shot was taken at 15 minute intervals, this suggests Yemen coverage could be found on NYT’s feed for almost 22 hours while lasting just 30 minutes on Fox’s feed.

The same query also suggests that aggregate RSS-coverage of Barron and Melania Trump moving into the Whitehouse lasted more than 50 hours across each news source.

Reading Shapefiles into Pandas Dataframes

I’ve just about had it up to here with and ArcMap and arcpy. Today I begin my quest to free myself from ever needing to rely on ESRI for spatial analysis and mapping.

Geopandas seems great, but I have had a lot of trouble getting it installed and have therefore been hesitant to rely on it in any package I create. Instead, I’ve used the following snippet to read a shapefile into a Pandas dataframe for quick analysis. You will need the pyshp package and Pandas. If you don’t have these, install them via pip and you’re ready to go:

import shapefile #the pyshp module
import pandas as pd

#read file, parse out the records and shapes
shapefile_path = r'path/to/shapefile/'
sf = shapefile.Reader(shapefile_path)

#grab the shapefile's field names (omit the first psuedo field)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]

#write the records into a dataframe
shapefile_dataframe = pd.DataFrame(columns=fields, data=records)

#add the coordinate data to a column called "coords"
shapefile_dataframe = shapefile_dataframe.assign(coords=shps)

Now shapefile_dataframe has all of the input shapefile’s records and geometry data.

© 2017. All rights reserved.