News
2007-09-20:
New imdc-submission-tools version 20070920.0 is significantly changed since
version 0.25:
- Added the "data-to-yaml" tool to automatically analyze data files and
generate metadata (in YAML form).
- Added the "subcat" tool to merge YAML files generated by data-to-yaml with
manually written YAML files into an XML submission file.
Who can contribute?
At this early stage of
DatCat deployment, contributions are by invitation only.
Eventually, we plan to make contribution open to anyone with a
login account.
What should I contribute?
Contribute metadata about any network measurement data you have that you are
potentially willing to make available to network researchers.
For specific recommendations on what type of metadata to include,
refer to CAIDA's web page on
How to Document a Data Collection.
But I don't want to give unlimited access to my data.
IMDC does not store the data itself, only metadata to help researchers
identify and find the data. Storage of the data is up to you, leaving you
in complete control of access to the data.
How do I contribute?
Before you can begin, you must download
imdc-submission-tools-20080324.0.tar.gz for Unix-like systems, including Mac OS X
(there are currently no tools for Windows platforms, but we're working on it).
There are two steps in contributing metadata: generating a submission,
and submitting it.
The easiest way to generate a submission is to use data-to-yaml
to automatically analyze your data, and then use subcat
to merge those results with other information you provide.
You can also generate a submission with the Submission.pm perl API,
which is useful for automating recurring submissions or for extracting metadata
from a source other than the raw data files.
Using the
submission page
is the easiest way to submit the submission file.
Or, if you need more automation, you can submit with the imdcpost
command-line tool.
Submissions are made in "staging areas", which allow you to insert, edit,
view, and delete objects in the catalog before making them visible to
other users. You may have multiple staging areas, but objects in one staging
area can not reference objects in other staging areas.
When you are satisfied with your submitted staged objects, you can
Validate them and then Request Activation. A human reviewer will
review your staging area, and either Activate it to make it visible to
other users, or work with you to fix any problems.
Viewing, validating, deleting and requesting activation can all be done from the
submission page.
Once activated, objects can still be edited
via the web interface or by making an XML submission,
but can never be deleted.
If you are contributing data on behalf of some organization, consider
using a role account
instead of a personal account to make your submission.
Guidelines on writing submissions
To help people find your data (and that is why you're contributing,
after all) try to fill in as much information as possible. But remember that
it is often better to contribute incomplete information than nothing
at all, so don't let incomplete information stop you from contributing;
you can always come back later and edit your contributions to add more
information.
Fields may contain text in any language, provided that your browser
uses the correct character encoding.
Valid encodings include "UTF-8" and "ISO-8859-1".
But remember that the primary language of DatCat is English,
and some users may be unable to view non-western text,
so you should try to avoid using non-western text in key fields
(like "Name"), and provide western alternatives for any non-western
text you use in other fields.
Before continuing, be sure you are familiar with the
Object Type documentation.
- Common fields
-
- Name (required) (maximum length 128 characters)
- Objects should be given names that are descriptive, but brief.
Although IMDC does not require names of most objects to be unique,
you should strive to pick names that are unique to avoid confusion.
If an object's name is not unique,
IMDC will append a unique number whenever it displays that name.
Names on most objects are limited to
128 characters.
- creators
- A list of contacts who
created the real-world item described by this object, which does not
necessarily include the contact who is contributing it to IMDC.
If appropriate contact objects do not exist,
you can create contacts without logins to use as creators.
- Primary contact
- The preferred contact for
answering inquiries about the real-world item described by this object,
particularly if that contact differs from the creator(s).
For example, the primary contact could be a role contact (with a role email
address) corresponding to a team, and the creators could be person
contacts corresponding to members of the team.
- Short description (required) (maximum length 128 characters)
- The short description is the only free-form text that appears in search
results, so try to write text that will describe the most important
features of the object in a way that will give a broad categorization
of the object as well as distinguish the object from similar objects.
Because the short description displayed in a column in search results,
you may want to keep it brief by omitting information that is included
in the object's name or visible elsewhere on the search result page
(e.g., a Data object's format and
date).
- Description (allows markup) (maximum length 4000 characters)
- The description field can be up to
4000 characters,
so be as descriptive as you can.
The description appears only on an object's detail page.
Please write something here, even if you use Description URL to point at
a description on another web page.
- Description URL (maximum length 1018 characters)
- This should be the URL of a web page that describes the object, if there
is one. You might
want to use this if you already have an existing web page, or if many
objects share a large identical description.
- Keywords (maximum length 500 characters)
- A comma-separated list of words or phrases that users are likely to use
when trying to find your object.
For data and package, it is not necessary to include the file format,
since that can be searched separately.
When appropriate, try to use keywords that are already being used,
for ease of searching.
For a list keywords currently in use on Collection objects,
see the Keyword section of
Browse page.
But also feel free to make up your own new keywords for ideas
not covered by existing keywords.
- Citation text (maximum length 2000 characters)
- Citations displayed on detail pages are normally generated automatically
from other fields. But you can use this Citation Text field to alter the
automatically generated citation, by writing a partial BibTeX entry,
without a citation key, containing only the BibTeX fields you want to
override. The entry type (e.g.
@MISC
) and brackets
may be omitted if you do not want to override the automatic type.
To suppress an automatically
generated BibTeX field, include a corresponding field with an empty value.
There is one special case: for all object types except
Publication,
the automatic note
field contains the
object's DatCat URL, and can not be overridden;
if you specify a note, it will be appended to the end of the automatic note.
For example, setting citation text to
title = {My Alternate Title}, abstract = "", editor = "John Doe"
will replace the automatically generated title
field,
suppress the automatically generated abstract
field,
and add a new editor
field.
- Private ID (maximum length 40 characters)
- If you are contributing information to IMDC that already exists in
a separate database, you can use the private_id field of IMDC objects
to store an identifier from that other database in order to record
that relationship. IMDC will warn you if you attempt to insert objects
with duplicate Private IDs.
IMDC lets you search for your objects by private ID and get mappings
of your private IDs to IMDC Handles.
An object's Private ID is normally visible only to the object's contributor
and IMDC administrators, although it may be transmitted unencrypted.
- Data
-
Data objects, along with their corresponding Packages and Locations,
are by far the most common types of objects to contribute.
Each Data object must belong to at least one Package,
and should belong to at least one Collection.
- Name (required) (maximum length 128 characters)
- A good name for a
data object
may include some form of the type of data, date, location, and file format.
If the filename of the actual data file is descriptive enough,
that is usually a good choice for the IMDC object's name.
For example, "dump.out" is a poor name; "campus-2005-10-30.dag" is better.
Remember that for data,
the file to use is the most natural working form of the data,
which is usually not compressed or archived.
- Description (allows markup) (maximum length 4000 characters)
- There is no need to repeat anything here that
is already described in the corresponding
Data Format,
although in that case it would be wise to explicitly tell readers
that they can find more information on the format's detail page.
- Format (required)
- The format of the data file. Many common formats are already defined
in DatCat, but if there is no definition for the one you need, you can
contribute your own
Data Format object.
- Grouptags (maximum length 500 characters)
- A comma-separated list of words or phrases shared by
a related group of Data objects. Operationally, group tags work just
like keywords, but conceptually they are more like Collections in that
they identify a set of related objects. Unlike Collections, however,
there is no "Group" object with a description or other fields; membership
in a group is determined solely by the existence of a group tag on the
object.
Use a group tag instead of a Collection when the set of objects
is one that users would generally not want to search for directly.
Group tags are useful when a user has found one Data object, and wants
to find other closely related objects, but where the group doesn't need
a description and doesn't need to appear in Collection searches.
For large data sets, it is often desirable to define a Collection for
the entire set, and then define group tags for closely-related subsets.
For example, a set of traceroutes taken daily for several months from
10 hosts might all belong to a single Collection,
with a set of groups for each set of traces taken on the same day,
and a set of groups for each set of traces taken from the same host;
the group tags would be the name of the collection
plus the date or host name.
- Geographic location (maximum length 4000 characters)
- Network location (maximum length 4000 characters)
- Logistic location (maximum length 4000 characters)
- Fill in whatever is relevant to the data. For example, the network
or logistic location is important to a packet trace or active probe,
but a Route-Views snapshot of the global routing table doesn't really
have a location. To maximize the usefulness of geographic location
for searching, be complete; e.g. enter "San Diego, California, US"
instead of just "San Diego".
Line breaks in these fields are preserved in display.
- Platform (maximum length 4000 characters)
- The hardware, software version, and OS used to collect the data,
if it's relevant to the data collection.
Line breaks in this field are preserved in display.
- Time zone (maximum length 40 characters)
- The time zone of data may be useful for interpreting diurnal
patterns. In a future version of IMDC, users will have the option of
displaying a Data object's timestamps in that object's time zone.
See also: List of time zones.
- Creation process (allows markup) (maximum length 4000 characters)
- Describe the procedure used to collect this particular data file.
Include all configuration parameters,
and quote command lines verbatim if possible.
General information common to all data
objects with the same format should usually be relegated to the
Data Format's description.
- Citation collection
- If you would prefer users to refer to a collection rather than directly
to this data object when making a citation, specify that collection here.
- Collection
-
A Collection is a set of Data objects with a common purpose or use,
often collected as part of a single effort.
If you want to make a Collection X, and there exists a Collection Y that
contains a subset of the Data you want to include in Collection X, then
instead of repeating Y's contents in X, you should consider putting Y itself
inside X.
On the other hand, if your set of Data objects is already contained by a
larger Collection and the set does not have a compelling purpose of its own,
you might consider putting a common
group tag
on the set of Data objects instead of making them a Collection.
Normally a Collection should contain multiple items, although during the
contribution process a Collection is allowed to be empty temporarily.
Empty collections are never displayed on
the Browse page.
- Name (required) (maximum length 128 characters)
- A good name for a
collection object
usually includes some indication of the purpose of the collection.
- Motivation (maximum length 400 characters)
- Describe the reason you thought it was useful to gather data objects
together into this single Collection.
This field is displayed only on the Collection's detail page.
- Summary (required) (maximum length 800 characters)
- The summary should describe the most important features of the Collection,
to help users decide if the collection is potentially interesting to them.
This may include descriptions of any or all of its
contents, purpose, creators, location, timing, or how it was created.
The summary will appear alone (without the description) in the
"more information" view of Collection search results
and may appear on the Browse page.
Technical details should be put in the description, not the summary.
- Description (allows markup) (maximum length 4000 characters)
-
The description is displayed after the summary on a collection's detail
page, so should not repeat the summary verbatim, but should expand upon
it. If you feel the need to include statistics or other dated information
for a collection that may grow in the future, be sure to mention the
date on which that information was correct.
- Data start time
- The earliest time represented by data in the collection.
- Data duration
- Use 0 seconds to indicate that the collected Data objects represent a
single instant;
use the value
ongoing
to indicate that addition of new Data objects
to the Collection is ongoing.
- Publication
-
A Publication in IMDC describes a scholarly paper, article, or other
publication, but more importantly for IMDC, it organizes the Data
used by the publication.
Thus, it can be considered a kind of specialized Collection.
Like a Collection, a Publication may contain other Collection or Publication
objects in order to incorporate their contents.
- Title (required) (maximum length 128 characters)
- The full title of the paper, including correct capitalization.
- Publication date (required) (maximum length 10 characters)
- The date of publication, in "YYYY", "YYYY-MM", or "YYYY-MM-DD" format.
- Venue (required) (maximum length 128 characters)
- This should be the full name of the venue, but should usually not
contain the date.
- Summary (required) (maximum length 800 characters)
- The summary should describe the Publication in a couple sentences.
The summary will appear in several places where the abstract will
not: the "more information" view of Publication search results,
and possibly on the Browse page.
- Abstract (allows markup) (maximum length 4000 characters)
-
The description is displayed after the summary on a collection's detail
page, so should not repeat the summary verbatim, but should expand upon
it. If you feel the need to include statistics or other dated information
for a collection that may grow in the future, be sure to mention the
date on which that information was correct.
- Data start time
- The earliest time represented by data used by the publication.
- Data duration
- Use 0 seconds to indicate that the used Data objects represent a
single instant.
Unlike a Collection, a Publication's data duration should not be marked
as ongoing, since a publication could not have used data that did not exist
at the time of publication.
- Citation text (maximum length 2000 characters)
- Citations for publications folow the rules as
citations for other object types,
with the exception that the
note
field is not set automatically,
and the DatCat URL will appear in a field named
datcat_url
.
The following BibTeX fields will be automatically generated from
information in the catalog:
title, author, year, month, datcat_url, and abstract.
You must use citation_text to manually set any other information you
wish to appear, including the BibTeX entry type
("ARTICLE", "TECHREPORT", "INPROCEEDINGS", etc.).
For example,
@INPROCEEDINGS{
booktitle = "Annals of Internet Research",
editor = "Edward I. Tor",
volume = "42",
pages = "107-116",
publisher = "Megadodo Publications"
}
- Package
-
Each Package must contain at least one Data object.
Also, each Package must have at least one corresponding Location
or be nested inside another Package that has a Location.
- Name (required) (maximum length 128 characters)
- A good name for a
package object
may include some form of the type of data, date, location, and file format.
If the filename of the actual package file is descriptive enough,
that is usually a good choice for the IMDC object's name.
For example, "dump.out" is a poor name; "campus-2005-10-30.dag" is better.
If you are making the raw uncompressed data file available
for download, then the Data and
Package
refer to the same file, and thus may reasonably have the same name.
- Description (allows markup) (maximum length 4000 characters)
- There is no need to repeat anything here that
is already described in the corresponding
Package Format,
although in that case it would be wise to explicitly tell readers
that they can find more information on the format's detail page.
- Format (required)
- The format of the package file. This is usually an archive or compression
format. Many common formats are already defined
in DatCat, but if there is no definition for the one you need, you can
contribute your own
Package Format object.
- contents
- Package contents are files corresponding to data or other packages.
If a package contains only a single data file, and is not compressed,
its format should be
not packaged
.
A package may contain nested packages if the nested package requires
a second tool or step to unpack it;
for example, an ISO image package file containing a zip package file.
On the other hand, a .tar.gz package file
can be unpacked with a single tar command,
so is best represented as a single
tar-gzip
-format
package, instead of a
gzip
-format package
containing a
tar
-format package.
- Data Format and Package Format
-
- file suffixes
- Give a list of commonly used file suffixes for files with this format.
Remember to include the leading period (if any).
If you would like to add a new field to all objects that use your format, you
can achieve this effect by defining a new
Annotation Key that is
specific to your format. For example, if files that use your IceCream
format can have different flavors, you might want to define a format-specific
flavor
Annotation Key
so that anyone who contributes an IceCream
Data
object can specify its flavor.
- Contact
-
- Full name (required) (length 3 to 256 characters)
- Contact names should be
the full name of the person or organization.
- Location
-
- Download URL (maximum length 1018 characters)
- This should be a URL that links directly to a copy of the package
file.
If there is no such direct link, give instructions in
download procedure instead.
If there is a direct URL, but it is password-protected or otherwise
restricted, you can specify it here, but be sure to set
availability to
restricted
and give further instructions
in download procedure.
- Availability (required)
- If the package's availability is
restricted
,
you must give instructions in download procedure
on how to obtain permission.
- Download procedure (allows markup) (maximum length 1018 characters)
- Give instructions on obtaining the package at this location. This
is required unless there is a download URL and
availability is
free
.
If your instructions include a URL, remember to use
markup.
- Geographic location (maximum length 4000 characters)
- Logistic location (maximum length 4000 characters)
- These can help users predict what kind of performance to expect from the
server.
To maximize the usefulness of geographic location,
be complete; e.g. enter "San Diego, California, US"
instead of just "San Diego".
Line breaks in these fields are preserved in display.
- Annotation Key
-
- Name (required) (maximum length 128 characters)
- Annotation Key names must be
unique within the namespace defined by the definer, object type,
and format. Additionally, the following conventions are not enforced,
but are recommended to help maintain consistency and readability:
- Keys whose names begin with a period-delimited prefix
belong to a group of related keys.
- In particular, keys whose names begin with
cfg.
describe configuration parameters.
- Use
_
to separate words in key names.
- Use a
min_
or max_
prefix for the
minimum or maximum value of a parameter or statistic.
- Keys whose names end with
_count
describe counters.
Avoid num_
, no_
, and n_
.
- Avoid abbreviations, except for those that are very well known.
- Browse the
standard annotation keys to see other conventions.
- Standard
- If the key is specific to a format that you contributed,
you can set Standard to
yes
to declare that the key is
an extension of the format. When the key is displayed, it will
appear with the format as its Definer. Otherwise, the
contributor (you) will be its Definer.
- Short description (required) (maximum length 128 characters)
- Keep in mind that a key's short description will appear in annotation key
search, and (in many browsers) as a
tooltip
when a user
hovers over a key's name on an object's detail page.
- Description (allows markup) (maximum length 4000 characters)
- Give a full definition of the key. If the key's type is numeric,
be sure to define the unit of measurement here. If the key allows
a limited set of values, enumerate or define that set here.
Markup
Some large text fields allow the contributor to include markup in the text.
There are two choices for type of markup:
- none
- Line breaks are preserved, and every other sequence of whitespace is
collapsed to a single space.
- HTML
- Not true
HTML,
but a variation that allows a subset of standard HTML tags and attributes,
plus the following additional tags:
imdcnum
- Formats a number with thin spaces between sets of three digits
for improved readability (in browsers with standards-compliant CSS
support; in others, imdcnum has no effect).
This is preferred over the use of
commas (which look like decimal marks in many European languages)
or periods (which look like decimal marks in English).
For example,
<imdcnum>87654321.012345</imdcnum>
will be rendered as
87654321.012345
instead of as 87654321.012345
.
imdcref
- Creates a link to the detail page of an IMDC object.
Exactly one of the
self, handle,
or xid attributes must be specified.
When IMDC generates a page containing a <imdcref>,
it looks up the name then, and uses the name as the link text.
For example, if an object has a marked-up field containing
<imdcref handle="/contact/1-0002-H">
then the generated detail page will contain HTML that renders as
Ken Keys
and links to the detail page for /contact/1-0002-H.
Another example: to contribute a new object that mentions
its own handle in a marked-up field, despite the fact that you don't
know what its handle will be yet, you could write something like
My handle is <imdcref self asHandle>
End tag: none
Attributes:
self
- specifies that the link should refer to the object
to which this field belongs.
handle=handle
- specifies the IMDC Handle
of the IMDC object to which the link will refer.
xid=xid
- specifies the xid
of the IMDC object to which the link will refer.
The
xid attribute is valid only in XML submission
files, and must refer to an xid defined before the end of the
current transaction.
asHandle
- display the object's full Handle instead of the object's
name.
imdcsearch
- Creates a link to a search result page. The
objtype
attribute must be specified, along with one or more other attributes
to specify the search criteria. String search criteria allow
*
and ?
wildcards; numeric search
criteria require equality in order to match.
For example, if an object has a marked-up field containing
<imdcsearch objtype="data" grouptag="foo-2005-12">Data
in the foo-2005-12 group</imdcsearch>
then the generated detail page will contain HTML that renders as
Data in the foo-2005-12 group
and links to a search result page as described.
Attributes marked as taking objref values
in the list below refer to other objects. The value of these
attributes may begin with
handle:
or name:
to indicate that the rest of the value specifies the other object
by handle or name. If neither of these prefixes is used,
the entire value is treated as a name.
For example,
<imdcsearch objtype="data"
contributor="handle:/contact/1-001W-1">Data
contributed by CAIDA</imdcsearch>
will generate HTML that renders as
Data contributed by CAIDA
and links to a search result page as described.
End tag: required
Attributes:
objtype=type
- The type of object to search for:
contact,
dformat (data format),
pformat (package format),
data,
collection,
package,
location,
annkey, or
annotation.
name=string
- search by object's name
contributor=objref
- search for objects with the specified contributor
creator=objref
- search for objects with the specified creator
short_desc=string
- search by object's short description
desc=string
- search by object's description
email=string
- for contact only: search by contact's email
cc=string
- for contact only: search by contact's 2-letter ISO country code
org=string
- for contact only: search by contact's organization
type=type
- for contact only:
person or role
- for formats only:
text, binary,
or mixed
suffix=string
- for formats only: search by format's file suffix
format=objref
- for data and package only: search for objects with the
specified format
size=number
- for data and package only: search by object's size in bytes
geoloc=string
- for data and location only: search by object's geographic
location
logloc=string
- for data and location only: search by object's logistic
location
netloc=string
- for data only: search by object's network location
collection=objref
- for data only: search for data that belong to the
specified collection (but an
imdcref
pointing to the collection is usually better than an
imdcsearch pointing to the contents of
the collection).
grouptags=string[,string]...
- for data only:
search for objects that have a group tag matching all of
the strings in the comma-separated list
keywords=string[,string]...
- for data, package, format, and collection only:
search for objects that have a keyword matching all of
the strings in the comma-separated list
valtype=type
- for annkey only:
string, integer,
real, or boolean
showform=1
- link to a search form instead of a search result page.
The form will be pre-filled-in with values according to
the other imdcsearch attributes.
The accepted standard HTML tags are:
a,
b,
big,
blockquote,
br,
cite,
code,
dd,
dfn,
dl,
dt,
em,
h4,
h5,
h6,
hr,
i,
kbd,
li,
ol,
p,
pre,
q,
samp,
small,
strong,
sub,
sup,
table,
td,
th,
tr,
tt,
ul,
var.
The accepted standard HTML attributes are:
align,
border,
colspan,
href,
rowspan,
valign.
Particularly bad cases of invalid markup will be rejected at
contribution time.
Less bad cases will be silently and automatically cleaned up.