obspy.clients.fdsn.mass_downloader

Mass Downloader for FDSN Compliant Web Services

This package contains functionality to query and integrate data from any number of FDSN web service providers simultaneously. The package aims to formulate download requests in a way that is convenient for seismologists without having to worry about political and technical data center issues. It can be used by itself or as a library component integrated into a bigger project.

copyright:: Lion Krischer (krischer@geophysik.uni-muenchen.de), 2014-2015
license:: GNU Lesser General Public License, Version 3 (https://www.gnu.org/copyleft/lesser.html)

Why Would You Want to Use This?

Directly using the FDSN web services for example via the obspy.clients.fdsn client is fine for small amounts of data but quickly becomes cumbersome for larger data sets. Many data centers do provide tools to easily download larger amounts of data but that is usually only from one data center. Now most seismologists don’t really care a lot where the data they download originates - they just want the data for their use case and oftentimes they want as much data as they can get. As the number of FDSN compliant web services increases this becomes more and more cumbersome. That is where this module comes in. You

specify the geographical region from which to download data,
define a number of other restrictions (temporal, data quality, …),
and launch the download.

The mass downloader module will acquire all waveforms and associated station information across all known FDSN web service implementations producing a clean data set ready for further use. It works by

figuring out what stations each provider offers,
downloading MiniSEED and associated StationXML meta information in an efficient and data center friendly manner, and
dealing with all the nasty real-world data issues like missing or incomplete data, duplicate data across data centers, e.g.
- Basic optional automatic quality control by assuring that the data has no-gaps/overlaps or is available for a certain percentage of the requested time span.
- It can relaunch download to acquire missing pieces which might happen for example if a data center has been offline.
- It can assure that there always is a corresponding StationXML file for the waveforms.

Usage Examples 

Before delving into the nitty-gritty details of how it works and why it does things in a certain way we’ll demonstrate the usage of this module on two annotated examples. They can serve as templates for your own needs.

Earthquake Data 

The classic seismological data set consists of waveform recordings for a certain earthquake. This example downloads all data it can find for the Tohoku-Oki Earthquake from 5 minutes before the earthquake centroid time to 1 hour after. It will furthermore only download data with an epicentral distance between 70.0 and 90.0 degrees and some additional restrictions. Be aware that this example will attempt to download data from all FDSN data centers that ObsPy knows of and combine it into one data set.

import obspy
from obspy.clients.fdsn.mass_downloader import CircularDomain, \
    Restrictions, MassDownloader

origin_time = obspy.UTCDateTime(2011, 3, 11, 5, 47, 32)

# Circular domain around the epicenter. This will download all data between
# 70 and 90 degrees distance from the epicenter. This module also offers
# rectangular and global domains. More complex domains can be defined by
# inheriting from the Domain class.
domain = CircularDomain(latitude=37.52, longitude=143.04,
                        minradius=70.0, maxradius=90.0)

restrictions = Restrictions(
    # Get data from 5 minutes before the event to one hour after the
    # event. This defines the temporal bounds of the waveform data.
    starttime=origin_time - 5 * 60,
    endtime=origin_time + 3600,
    # You might not want to deal with gaps in the data. If this setting is
    # True, any trace with a gap/overlap will be discarded.
    reject_channels_with_gaps=True,
    # And you might only want waveforms that have data for at least 95 % of
    # the requested time span. Any trace that is shorter than 95 % of the
    # desired total duration will be discarded.
    minimum_length=0.95,
    # No two stations should be closer than 10 km to each other. This is
    # useful to for example filter out stations that are part of different
    # networks but at the same physical station. Settings this option to
    # zero or None will disable that filtering.
    minimum_interstation_distance_in_m=10E3,
    # Only HH or BH channels. If a station has HH channels, those will be
    # downloaded, otherwise the BH. Nothing will be downloaded if it has
    # neither. You can add more/less patterns if you like.
    channel_priorities=["HH[ZNE]", "BH[ZNE]"],
    # Location codes are arbitrary and there is no rule as to which
    # location is best. Same logic as for the previous setting.
    location_priorities=["", "00", "10"])

# No specified providers will result in all known ones being queried.
mdl = MassDownloader()
# The data will be downloaded to the ``./waveforms/`` and ``./stations/``
# folders with automatically chosen file names.
mdl.download(domain, restrictions, mseed_storage="waveforms",
             stationxml_storage="stations")

Continuous Request 

Another use case requiring massive amounts of data are noise studies. Ambient seismic noise correlations require continuous recordings from stations over a large time span. This example downloads data, from within a certain geographical domain, for a whole year. Individual MiniSEED files will be split per day. The download helpers will attempt to optimize the queries to the data centers and split up the files again if required.

import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \
    Restrictions, MassDownloader

# Rectangular domain containing parts of southern Germany.
domain = RectangularDomain(minlatitude=30, maxlatitude=50,
                           minlongitude=5, maxlongitude=35)

restrictions = Restrictions(
    # Get data for a whole year.
    starttime=obspy.UTCDateTime(2012, 1, 1),
    endtime=obspy.UTCDateTime(2013, 1, 1),
    # Chunk it to have one file per day.
    chunklength_in_sec=86400,
    # Considering the enormous amount of data associated with continuous
    # requests, you might want to limit the data based on SEED identifiers.
    # If the location code is specified, the location priority list is not
    # used; the same is true for the channel argument and priority list.
    network="BW", station="A*", location="", channel="EH*",
    # The typical use case for such a data set are noise correlations where
    # gaps are dealt with at a later stage.
    reject_channels_with_gaps=False,
    # Same is true with the minimum length. All data might be useful.
    minimum_length=0.0,
    # Guard against the same station having different names.
    minimum_interstation_distance_in_m=100.0)

# Restrict the number of providers if you know which serve the desired
# data. If in doubt just don't specify - then all providers will be
# queried.
mdl = MassDownloader(providers=["LMU", "GFZ"])
mdl.download(domain, restrictions, mseed_storage="waveforms",
             stationxml_storage="stations")

Usage 

Using the download helpers requires the definition of three separate things, all of which are detailed in the following paragraphs.

Data Selection: The data to be downloaded can be defined by enforcing geographical or temporal constraints and a couple of other options.
Storage Options: Choosing where the final MiniSEED and StationXML files should be stored.
Start the Download: Choose from which provider(s) to download and then launch the downloading process.

Step 1: Data Selection 

Data set selection serves the purpose to limit the data to be downloaded to data useful for the purpose at hand. It is handled by two objects: subclasses of the Domain object and the Restrictions class.

The domain module currently defines three different domain types used to limit the geographical extent of the queried data: RectangularDomain, CircularDomain, and GlobalDomain. Subclassing Domain enables the construction of arbitrarily complex domains. Please see the domain module for more details. Instances of these classes will later be passed to the function sparking the downloading process. A rectangular domain for example is defined like this:

>>> from obspy.clients.fdsn.mass_downloader.domain import RectangularDomain
>>> domain = RectangularDomain(minlatitude=-10, maxlatitude=10,
...                            minlongitude=-10, maxlongitude=10)

Additional restrictions like temporal bounds, SEED identifier wildcards, and other things are set with the help of the Restrictions class. Please refer to its documentation for a more detailed explanation of the parameters.

>>> from obspy import UTCDateTime
>>> from obspy.clients.fdsn.mass_downloader import Restrictions
>>> restrict = Restrictions(
...     starttime=UTCDateTime(2012, 1, 1),
...     endtime=UTCDateTime(2012, 1, 1, 1),
...     network=None, station=None, location=None, channel=None,
...     reject_channels_with_gaps=True,
...     minimum_length=0.9,
...     minimum_interstation_distance_in_m=1000,
...     channel_priorities=["HH[ZNE]", "BH[ZNE]"],
...     location_priorities=["", "00", "01"])

Step 2: Storage Options 

After determining what to download, the helpers must know where to store the requested data. That requires some flexibility in case the mass downloader is to be integrated as a component into a bigger system. An example is a toolbox that has a database to manage its data.

A major concern is to not download pre-existing data. In order to enable such a use case the download helpers can be given functions that are evaluated when determining the file names of the requested data. Depending on the return value, the helper class will download the whole, part, or even none, of that particular piece of data.

Storing MiniSEED waveforms

The MiniSEED storage rules are set by the mseed_storage argument of the download() method of the MassDownloader class

Option 1: Folder Name

In the simplest case it is just a folder name:

>>> mseed_storage = "waveforms"

This will cause all MiniSEED files to be stored as

waveforms/NETWORK.STATION.LOCATION.CHANNEL__STARTTIME__ENDTIME.mseed.

An example of this is

waveforms/BW.FURT..BHZ__20141027T163723Z__20141027T163733Z.mseed

which is rather general but also quite long.

Option 2: String Template

For more control use the second possibility and provide a string containing {network}, {station}, {location}, {channel}, {starttime}, and {endtime} format specifiers. These values will be interpolated to acquire the final filename. The start and end times will be formatted with strftime() with the specifier "%Y%m%dT%H%M%SZ" in an effort to avoid colons which are troublesome in file names on many systems.

>>> mseed_storage = ("some_folder/{network}/{station}/"
...                  "{channel}.{location}.{starttime}.{endtime}.mseed")

results in

some_folder/BW/FURT/BHZ..20141027T163723Z.20141027T163733Z.mseed.

The download helpers will create any non-existing folders along the path.

Option 3: Custom Function

The most complex but also most powerful possibility is to use a function which will be evaluated to determine the filename. If the function returns True , the MiniSEED file is assumed to already be available and will not be downloaded again; keep in mind that in that case no station data will be downloaded for that channel. If it returns a string, the MiniSEED file will be saved to that path. Utilize closures to use any other parameters in the function. This hypothetical function checks if the file is already in a database and otherwise returns a string which will be interpreted as a filename.

>>> def get_mseed_storage(network, station, location, channel, starttime,
...                       endtime):
...     # Returning True means that neither the data nor the StationXML file
...     # will be downloaded.
...     if is_in_db(network, station, location, channel, starttime, endtime):
...         return True
...     # If a string is returned the file will be saved in that location.
...     return os.path.join(ROOT, "%s.%s.%s.%s.mseed" % (network, station,
...                                                      location, channel))
>>> mseed_storage = get_mseed_storage

Note

No matter which approach is chosen, if a file already exists, it will not be overwritten; it will be parsed and the download helper class will attempt to download matching StationXML files.

Storing StationXML files

The same logic applies to the StationXML files. This time the rules are set by the stationxml_storage argument of the download() method of the MassDownloader class. StationXML files will be downloaded on a per-station basis thus all channels and locations from one station will end up in the same StationXML file.

Option 1: Folder Name

A simple string will be interpreted as a folder name. This example will save the files to "stations/NETWORK.STATION.xml", e.g. to "stations/BW.FURT.xml".

>>> stationxml_storage = "stations"

Option 2: String Template

Another option is to provide a string formatting template, e.g.

>>> stationxml_storage = "some_folder/{network}/{station}.xml"

will write to "some_folder/NETWORK/STATION.xml", in this case for example to "some_folder/BW/FURT.xml".

Note

If the StationXML file already exists, it will be opened to see what is in the file. In case it does not contain all necessary channels, it will be deleted and only those channels needed in the current run will be downloaded again. Pass a custom function to the stationxml_path argument if you require different behavior as documented in the following section.

Option 3: Custom Function

As with the waveform data, the StationXML paths can also be set with the help of a function. The function in this case is a bit more complex than for the waveform case. It has to return a dictionary with three keys: "available_channels", "missing_channels", and "filename". "available_channels" is a list of channels that are already available as station information and that require no new download. Make sure to include all already available channels; this information is later used to discard MiniSEED files that have no corresponding station information. "missing_channels" is a list of channels for that particular station that must be downloaded and "filename" determines where to save these. Please note that in this particular case the StationXML file will be overwritten if it already exists and only the "missing_channels" will be downloaded to it, independent of what already exists in the file.

Alternatively the function can also return a string and the behaviour is the same as two first options for the stationxml_storage argument.

The next example illustrates a complex use case where the availability of each channel’s station information is queried in some database and only those channels that do not exist yet will be downloaded. Use closures to pass more arguments to the function.

>>> def get_stationxml_storage(network, station, channels, starttime, endtime):
...     available_channels = []
...     missing_channels = []
...     for location, channel in channels:
...         if is_in_db(network, station, location, channel, starttime,
...                     endtime):
...             available_channels.append((location, channel))
...         else:
...             missing_channels.append((location, channel))
...     filename = os.path.join(ROOT, "%s.%s.xml" % (network, station))
...     return {
...         "available_channels": available_channels,
...         "missing_channels": missing_channels,
...         "filename": filename}
>>> stationxml_storage = get_stationxml_storage

Step 3: Start the Download 

The final step is to actually start the download. Pass the previously created domain, restrictions, and path settings and off you go. Two more parameters of interest are the chunk_size_in_mb setting which controls how much data is requested per thread, client and request. threads_per_clients control how many threads are used to download data in parallel per data center - 3 is a value in agreement with some data centers.

>>> mdl = MassDownloader()  
>>> mdl.download(domain, restrictions, chunk_size_in_mb=50,
...              threads_per_client=3, mseed_storage=mseed_storage,
...              stationxml_storage=stationxml_storage)  

How it Works 

At a high level the mass downloader works by looping over each FDSN web service and downloading whatever it offers. A bit more detail:

Loop over all passed or known FDSN web service implementations and auto-discover if they are available and what they can do. If an implementation has a dataselect and a station service it will be part of the following steps. Otherwise it will be discarded.
For each web service client:
1. Request the availability for the given time and domain settings. It will request a text file from the station service at the channel level. If the service supports the matchtimeseries parameter it will be used and the availability is considered to be “reliable” for the further stages.
2. Channel and location priorities are applied resulting in a single instrument per station.
3. Any already existing network + station combinations are discarded.
4. If the availability for the particular client is considered reliable it will perform the minimum distance filtering now. If no stations have already been downloaded it will select the largest subset of stations satisfying the minimum interstation distance constraint. Otherwise it will successively add new stations with the largest distance to the closest already existing station until no more stations satisfying the minimum distance remain. This results in the maximum possible amount of chosen stations satisfying the constraints.
5. Download the MiniSEED data - this is threaded and it will use a bulk request honoring the desired chunk_size_in_mb setting. Afterwards it splits the MiniSEED files again to match the desired restrictions. The split happens at the record level thus no information available in the original MiniSEED records is lost.
6. Any MiniSEED files not fulfilling the minimum length or no/gap overlap restrictions will be deleted. Faulty MiniSEED files as well.
7. For each downloaded MiniSEED file: Download the corresponding StationXML file at the response level.
8. If the sanitize argument of the Restrictions object is True, delete all MiniSEED files for which no station information could be downloaded. This is a useful setting if you want a clean data set.
1. If the availability information is not reliable, perform the minimum interstation distance filtering now. This is a bit unfortunate but many client do return pretty terrible availability information (or interpret the station service differently) so there is no way around that for now.
2. Rinse and repeat for all remaining FDSN web service implementations.

Logging 

The download helpers utilizes Python’s logging facilities. By default it will log to stdout at the logging.INFO level which provides a fair amount of detail. If you want to change the log level or setup a different stream handler, just get the corresponding logger after you import the download helpers module:

>>> import logging
>>> logger = logging.getLogger("obspy.clients.fdsn.mass_downloader")
>>> logger.setLevel(logging.DEBUG)  

Authentication 

To make the mass downloader work for restricted data, just initialize it with existing Client instances that have credentials. Note that you can mix already initialized clients with varying credientials and just passing the name of the FDSN services to query.

>>> from obspy.clients.fdsn import Client
>>> client_orfeus = Client("ORFEUS", user="random", password="some_pw")
>>> client_eth = Client("ETH", user="from_me", password="to_you")
>>> mdl = MassDownloader(providers=[client_orfeus, "EARTHSCOPE", client_eth])     

Further Documentation 

Further functionality of this module is documented at a couple of other places: