DatCat℠Internet Measurement Data Catalog

|  Home |  Browse |  Search |  Help |
You are not logged in. | Log in | Create an Account
Contact us

Help Topics:

Contributing Metadata

News

2007-09-20: New imdc-submission-tools version 20070920.0 is significantly changed since version 0.25:
  • Added the "data-to-yaml" tool to automatically analyze data files and generate metadata (in YAML form).
  • Added the "subcat" tool to merge YAML files generated by data-to-yaml with manually written YAML files into an XML submission file.

Who can contribute?

At this early stage of DatCat deployment, contributions are by invitation only. Eventually, we plan to make contribution open to anyone with a login account.

What should I contribute?

Contribute metadata about any network measurement data you have that you are potentially willing to make available to network researchers. For specific recommendations on what type of metadata to include, refer to CAIDA's web page on How to Document a Data Collection.

But I don't want to give unlimited access to my data.

IMDC does not store the data itself, only metadata to help researchers identify and find the data. Storage of the data is up to you, leaving you in complete control of access to the data.

How do I contribute?

Before you can begin, you must download imdc-submission-tools-20080324.0.tar.gz for Unix-like systems, including Mac OS X (there are currently no tools for Windows platforms, but we're working on it). There are two steps in contributing metadata: generating a submission, and submitting it.

The easiest way to generate a submission is to use data-to-yaml to automatically analyze your data, and then use subcat to merge those results with other information you provide. You can also generate a submission with the Submission.pm perl API, which is useful for automating recurring submissions or for extracting metadata from a source other than the raw data files.

Using the submission page is the easiest way to submit the submission file. Or, if you need more automation, you can submit with the imdcpost command-line tool.

Submissions are made in "staging areas", which allow you to insert, edit, view, and delete objects in the catalog before making them visible to other users. You may have multiple staging areas, but objects in one staging area can not reference objects in other staging areas. When you are satisfied with your submitted staged objects, you can Validate them and then Request Activation. A human reviewer will review your staging area, and either Activate it to make it visible to other users, or work with you to fix any problems. Viewing, validating, deleting and requesting activation can all be done from the submission page. Once activated, objects can still be edited via the web interface or by making an XML submission, but can never be deleted.

If you are contributing data on behalf of some organization, consider using a role account instead of a personal account to make your submission.

Guidelines on writing submissions

To help people find your data (and that is why you're contributing, after all) try to fill in as much information as possible. But remember that it is often better to contribute incomplete information than nothing at all, so don't let incomplete information stop you from contributing; you can always come back later and edit your contributions to add more information.

Fields may contain text in any language, provided that your browser uses the correct character encoding. Valid encodings include "UTF-8" and "ISO-8859-1". But remember that the primary language of DatCat is English, and some users may be unable to view non-western text, so you should try to avoid using non-western text in key fields (like "Name"), and provide western alternatives for any non-western text you use in other fields.

Before continuing, be sure you are familiar with the Object Type documentation.

Common fields
Name (required) (maximum length 128 characters)
Objects should be given names that are descriptive, but brief. Although IMDC does not require names of most objects to be unique, you should strive to pick names that are unique to avoid confusion. If an object's name is not unique, IMDC will append a unique number whenever it displays that name. Names on most objects are limited to 128 characters.
creators
A list of contacts who created the real-world item described by this object, which does not necessarily include the contact who is contributing it to IMDC. If appropriate contact objects do not exist, you can create contacts without logins to use as creators.
Primary contact
The preferred contact for answering inquiries about the real-world item described by this object, particularly if that contact differs from the creator(s). For example, the primary contact could be a role contact (with a role email address) corresponding to a team, and the creators could be person contacts corresponding to members of the team.
Short description (required) (maximum length 128 characters)
The short description is the only free-form text that appears in search results, so try to write text that will describe the most important features of the object in a way that will give a broad categorization of the object as well as distinguish the object from similar objects. Because the short description displayed in a column in search results, you may want to keep it brief by omitting information that is included in the object's name or visible elsewhere on the search result page (e.g., a Data object's format and date).
Description (allows markup) (maximum length 4000 characters)
The description field can be up to 4000 characters, so be as descriptive as you can. The description appears only on an object's detail page. Please write something here, even if you use Description URL to point at a description on another web page.
Description URL (maximum length 1018 characters)
This should be the URL of a web page that describes the object, if there is one. You might want to use this if you already have an existing web page, or if many objects share a large identical description.
Keywords (maximum length 500 characters)
A comma-separated list of words or phrases that users are likely to use when trying to find your object. For data and package, it is not necessary to include the file format, since that can be searched separately. When appropriate, try to use keywords that are already being used, for ease of searching. For a list keywords currently in use on Collection objects, see the Keyword section of Browse page. But also feel free to make up your own new keywords for ideas not covered by existing keywords.
Citation text (maximum length 2000 characters)
Citations displayed on detail pages are normally generated automatically from other fields. But you can use this Citation Text field to alter the automatically generated citation, by writing a partial BibTeX entry, without a citation key, containing only the BibTeX fields you want to override. The entry type (e.g. @MISC) and brackets may be omitted if you do not want to override the automatic type. To suppress an automatically generated BibTeX field, include a corresponding field with an empty value. There is one special case: for all object types except Publication, the automatic note field contains the object's DatCat URL, and can not be overridden; if you specify a note, it will be appended to the end of the automatic note. For example, setting citation text to
title = {My Alternate Title}, abstract = "", editor = "John Doe"
will replace the automatically generated title field, suppress the automatically generated abstract field, and add a new editor field.
Private ID (maximum length 40 characters)
If you are contributing information to IMDC that already exists in a separate database, you can use the private_id field of IMDC objects to store an identifier from that other database in order to record that relationship. IMDC will warn you if you attempt to insert objects with duplicate Private IDs. IMDC lets you search for your objects by private ID and get mappings of your private IDs to IMDC Handles. An object's Private ID is normally visible only to the object's contributor and IMDC administrators, although it may be transmitted unencrypted.


Data
Data objects, along with their corresponding Packages and Locations, are by far the most common types of objects to contribute. Each Data object must belong to at least one Package, and should belong to at least one Collection.
Name (required) (maximum length 128 characters)
A good name for a data object may include some form of the type of data, date, location, and file format. If the filename of the actual data file is descriptive enough, that is usually a good choice for the IMDC object's name. For example, "dump.out" is a poor name; "campus-2005-10-30.dag" is better. Remember that for data, the file to use is the most natural working form of the data, which is usually not compressed or archived.
Description (allows markup) (maximum length 4000 characters)
There is no need to repeat anything here that is already described in the corresponding Data Format, although in that case it would be wise to explicitly tell readers that they can find more information on the format's detail page.
Format (required)
The format of the data file. Many common formats are already defined in DatCat, but if there is no definition for the one you need, you can contribute your own Data Format object.
Grouptags (maximum length 500 characters)
A comma-separated list of words or phrases shared by a related group of Data objects. Operationally, group tags work just like keywords, but conceptually they are more like Collections in that they identify a set of related objects. Unlike Collections, however, there is no "Group" object with a description or other fields; membership in a group is determined solely by the existence of a group tag on the object. Use a group tag instead of a Collection when the set of objects is one that users would generally not want to search for directly. Group tags are useful when a user has found one Data object, and wants to find other closely related objects, but where the group doesn't need a description and doesn't need to appear in Collection searches. For large data sets, it is often desirable to define a Collection for the entire set, and then define group tags for closely-related subsets. For example, a set of traceroutes taken daily for several months from 10 hosts might all belong to a single Collection, with a set of groups for each set of traces taken on the same day, and a set of groups for each set of traces taken from the same host; the group tags would be the name of the collection plus the date or host name.
Geographic location (maximum length 4000 characters)
Network location (maximum length 4000 characters)
Logistic location (maximum length 4000 characters)
Fill in whatever is relevant to the data. For example, the network or logistic location is important to a packet trace or active probe, but a Route-Views snapshot of the global routing table doesn't really have a location. To maximize the usefulness of geographic location for searching, be complete; e.g. enter "San Diego, California, US" instead of just "San Diego". Line breaks in these fields are preserved in display.
Platform (maximum length 4000 characters)
The hardware, software version, and OS used to collect the data, if it's relevant to the data collection. Line breaks in this field are preserved in display.
Time zone (maximum length 40 characters)
The time zone of data may be useful for interpreting diurnal patterns. In a future version of IMDC, users will have the option of displaying a Data object's timestamps in that object's time zone. See also: List of time zones.
Creation process (allows markup) (maximum length 4000 characters)
Describe the procedure used to collect this particular data file. Include all configuration parameters, and quote command lines verbatim if possible. General information common to all data objects with the same format should usually be relegated to the Data Format's description.
Citation collection
If you would prefer users to refer to a collection rather than directly to this data object when making a citation, specify that collection here.


Collection
A Collection is a set of Data objects with a common purpose or use, often collected as part of a single effort. If you want to make a Collection X, and there exists a Collection Y that contains a subset of the Data you want to include in Collection X, then instead of repeating Y's contents in X, you should consider putting Y itself inside X. On the other hand, if your set of Data objects is already contained by a larger Collection and the set does not have a compelling purpose of its own, you might consider putting a common group tag on the set of Data objects instead of making them a Collection. Normally a Collection should contain multiple items, although during the contribution process a Collection is allowed to be empty temporarily. Empty collections are never displayed on the Browse page.
Name (required) (maximum length 128 characters)
A good name for a collection object usually includes some indication of the purpose of the collection.
Motivation (maximum length 400 characters)
Describe the reason you thought it was useful to gather data objects together into this single Collection. This field is displayed only on the Collection's detail page.
Summary (required) (maximum length 800 characters)
The summary should describe the most important features of the Collection, to help users decide if the collection is potentially interesting to them. This may include descriptions of any or all of its contents, purpose, creators, location, timing, or how it was created. The summary will appear alone (without the description) in the "more information" view of Collection search results and may appear on the Browse page. Technical details should be put in the description, not the summary.
Description (allows markup) (maximum length 4000 characters)
The description is displayed after the summary on a collection's detail page, so should not repeat the summary verbatim, but should expand upon it. If you feel the need to include statistics or other dated information for a collection that may grow in the future, be sure to mention the date on which that information was correct.
Data start time
The earliest time represented by data in the collection.
Data duration
Use 0 seconds to indicate that the collected Data objects represent a single instant; use the value ongoing to indicate that addition of new Data objects to the Collection is ongoing.


Publication
A Publication in IMDC describes a scholarly paper, article, or other publication, but more importantly for IMDC, it organizes the Data used by the publication. Thus, it can be considered a kind of specialized Collection. Like a Collection, a Publication may contain other Collection or Publication objects in order to incorporate their contents.
Title (required) (maximum length 128 characters)
The full title of the paper, including correct capitalization.
Publication date (required) (maximum length 10 characters)
The date of publication, in "YYYY", "YYYY-MM", or "YYYY-MM-DD" format.
Venue (required) (maximum length 128 characters)
This should be the full name of the venue, but should usually not contain the date.
Summary (required) (maximum length 800 characters)
The summary should describe the Publication in a couple sentences. The summary will appear in several places where the abstract will not: the "more information" view of Publication search results, and possibly on the Browse page.
Abstract (allows markup) (maximum length 4000 characters)
The description is displayed after the summary on a collection's detail page, so should not repeat the summary verbatim, but should expand upon it. If you feel the need to include statistics or other dated information for a collection that may grow in the future, be sure to mention the date on which that information was correct.
Data start time
The earliest time represented by data used by the publication.
Data duration
Use 0 seconds to indicate that the used Data objects represent a single instant. Unlike a Collection, a Publication's data duration should not be marked as ongoing, since a publication could not have used data that did not exist at the time of publication.
Citation text (maximum length 2000 characters)
Citations for publications folow the rules as citations for other object types, with the exception that the note field is not set automatically, and the DatCat URL will appear in a field named datcat_url. The following BibTeX fields will be automatically generated from information in the catalog: title, author, year, month, datcat_url, and abstract. You must use citation_text to manually set any other information you wish to appear, including the BibTeX entry type ("ARTICLE", "TECHREPORT", "INPROCEEDINGS", etc.). For example,
    @INPROCEEDINGS{
	booktitle = "Annals of Internet Research",
	editor = "Edward I. Tor",
	volume = "42",
	pages = "107-116",
	publisher = "Megadodo Publications"
    }
    
Package
Each Package must contain at least one Data object. Also, each Package must have at least one corresponding Location or be nested inside another Package that has a Location.
Name (required) (maximum length 128 characters)
A good name for a package object may include some form of the type of data, date, location, and file format. If the filename of the actual package file is descriptive enough, that is usually a good choice for the IMDC object's name. For example, "dump.out" is a poor name; "campus-2005-10-30.dag" is better. If you are making the raw uncompressed data file available for download, then the Data and Package refer to the same file, and thus may reasonably have the same name.
Description (allows markup) (maximum length 4000 characters)
There is no need to repeat anything here that is already described in the corresponding Package Format, although in that case it would be wise to explicitly tell readers that they can find more information on the format's detail page.
Format (required)
The format of the package file. This is usually an archive or compression format. Many common formats are already defined in DatCat, but if there is no definition for the one you need, you can contribute your own Package Format object.
contents
Package contents are files corresponding to data or other packages. If a package contains only a single data file, and is not compressed, its format should be not packaged. A package may contain nested packages if the nested package requires a second tool or step to unpack it; for example, an ISO image package file containing a zip package file. On the other hand, a .tar.gz package file can be unpacked with a single tar command, so is best represented as a single tar-gzip-format package, instead of a gzip-format package containing a tar-format package.


Data Format and Package Format
file suffixes
Give a list of commonly used file suffixes for files with this format. Remember to include the leading period (if any).

If you would like to add a new field to all objects that use your format, you can achieve this effect by defining a new Annotation Key that is specific to your format. For example, if files that use your IceCream format can have different flavors, you might want to define a format-specific flavor Annotation Key so that anyone who contributes an IceCream Data object can specify its flavor.

Contact
Full name (required) (length 3 to 256 characters)
Contact names should be the full name of the person or organization.


Location
Download URL (maximum length 1018 characters)
This should be a URL that links directly to a copy of the package file. If there is no such direct link, give instructions in download procedure instead. If there is a direct URL, but it is password-protected or otherwise restricted, you can specify it here, but be sure to set availability to restricted and give further instructions in download procedure.
Availability (required)
If the package's availability is restricted, you must give instructions in download procedure on how to obtain permission.
Download procedure (allows markup) (maximum length 1018 characters)
Give instructions on obtaining the package at this location. This is required unless there is a download URL and availability is free. If your instructions include a URL, remember to use markup.
Geographic location (maximum length 4000 characters)
Logistic location (maximum length 4000 characters)
These can help users predict what kind of performance to expect from the server. To maximize the usefulness of geographic location, be complete; e.g. enter "San Diego, California, US" instead of just "San Diego". Line breaks in these fields are preserved in display.


Annotation Key
Name (required) (maximum length 128 characters)
Annotation Key names must be unique within the namespace defined by the definer, object type, and format. Additionally, the following conventions are not enforced, but are recommended to help maintain consistency and readability:
  • Keys whose names begin with a period-delimited prefix belong to a group of related keys.
  • In particular, keys whose names begin with cfg. describe configuration parameters.
  • Use _ to separate words in key names.
  • Use a min_ or max_ prefix for the minimum or maximum value of a parameter or statistic.
  • Keys whose names end with _count describe counters. Avoid num_, no_, and n_.
  • Avoid abbreviations, except for those that are very well known.
  • Browse the standard annotation keys to see other conventions.
Standard
If the key is specific to a format that you contributed, you can set Standard to yes to declare that the key is an extension of the format. When the key is displayed, it will appear with the format as its Definer. Otherwise, the contributor (you) will be its Definer.
Short description (required) (maximum length 128 characters)
Keep in mind that a key's short description will appear in annotation key search, and (in many browsers) as a tooltip when a user hovers over a key's name on an object's detail page.
Description (allows markup) (maximum length 4000 characters)
Give a full definition of the key. If the key's type is numeric, be sure to define the unit of measurement here. If the key allows a limited set of values, enumerate or define that set here.


Markup

Some large text fields allow the contributor to include markup in the text. There are two choices for type of markup:
none
Line breaks are preserved, and every other sequence of whitespace is collapsed to a single space.

HTML
Not true HTML, but a variation that allows a subset of standard HTML tags and attributes, plus the following additional tags:
imdcnum
Formats a number with thin spaces between sets of three digits for improved readability (in browsers with standards-compliant CSS support; in others, imdcnum has no effect). This is preferred over the use of commas (which look like decimal marks in many European languages) or periods (which look like decimal marks in English). For example, <imdcnum>87654321.012345</imdcnum> will be rendered as 87654321.012345 instead of as 87654321.012345.

imdcref
Creates a link to the detail page of an IMDC object. Exactly one of the self, handle, or xid attributes must be specified. When IMDC generates a page containing a <imdcref>, it looks up the name then, and uses the name as the link text.

For example, if an object has a marked-up field containing

<imdcref handle="/contact/1-0002-H">

then the generated detail page will contain HTML that renders as Ken Keys and links to the detail page for /contact/1-0002-H.

Another example: to contribute a new object that mentions its own handle in a marked-up field, despite the fact that you don't know what its handle will be yet, you could write something like

My handle is <imdcref self asHandle>

End tag: none

Attributes:

self
specifies that the link should refer to the object to which this field belongs.
handle=handle
specifies the IMDC Handle of the IMDC object to which the link will refer.
xid=xid
specifies the xid of the IMDC object to which the link will refer. The xid attribute is valid only in XML submission files, and must refer to an xid defined before the end of the current transaction.
asHandle
display the object's full Handle instead of the object's name.


imdcsearch
Creates a link to a search result page. The objtype attribute must be specified, along with one or more other attributes to specify the search criteria. String search criteria allow * and ? wildcards; numeric search criteria require equality in order to match.

For example, if an object has a marked-up field containing

<imdcsearch objtype="data" grouptag="foo-2005-12">Data in the foo-2005-12 group</imdcsearch>

then the generated detail page will contain HTML that renders as Data in the foo-2005-12 group and links to a search result page as described.

Attributes marked as taking objref values in the list below refer to other objects. The value of these attributes may begin with handle: or name: to indicate that the rest of the value specifies the other object by handle or name. If neither of these prefixes is used, the entire value is treated as a name. For example,

<imdcsearch objtype="data" contributor="handle:/contact/1-001W-1">Data contributed by CAIDA</imdcsearch>

will generate HTML that renders as Data contributed by CAIDA and links to a search result page as described.

End tag: required

Attributes:

objtype=type
The type of object to search for: contact, dformat (data format), pformat (package format), data, collection, package, location, annkey, or annotation.
name=string
search by object's name
contributor=objref
search for objects with the specified contributor
creator=objref
search for objects with the specified creator
short_desc=string
search by object's short description
desc=string
search by object's description
email=string
for contact only: search by contact's email
cc=string
for contact only: search by contact's 2-letter ISO country code
org=string
for contact only: search by contact's organization
type=type
for contact only: person or role
for formats only: text, binary, or mixed
suffix=string
for formats only: search by format's file suffix
format=objref
for data and package only: search for objects with the specified format
size=number
for data and package only: search by object's size in bytes
geoloc=string
for data and location only: search by object's geographic location
logloc=string
for data and location only: search by object's logistic location
netloc=string
for data only: search by object's network location
collection=objref
for data only: search for data that belong to the specified collection (but an imdcref pointing to the collection is usually better than an imdcsearch pointing to the contents of the collection).
grouptags=string[,string]...
for data only: search for objects that have a group tag matching all of the strings in the comma-separated list
keywords=string[,string]...
for data, package, format, and collection only: search for objects that have a keyword matching all of the strings in the comma-separated list
valtype=type
for annkey only: string, integer, real, or boolean
showform=1
link to a search form instead of a search result page. The form will be pre-filled-in with values according to the other imdcsearch attributes.

The accepted standard HTML tags are: a, b, big, blockquote, br, cite, code, dd, dfn, dl, dt, em, h4, h5, h6, hr, i, kbd, li, ol, p, pre, q, samp, small, strong, sub, sup, table, td, th, tr, tt, ul, var. The accepted standard HTML attributes are: align, border, colspan, href, rowspan, valign.

Particularly bad cases of invalid markup will be rejected at contribution time. Less bad cases will be silently and automatically cleaned up.


Software version 1.7.24
Page generated at 2008‑07‑04 13:59:46 UTC
Request processed in 0.018 seconds
CAIDA Cooperative Association for Internet Data Analysis