Page Number:
IS562 -- Metadata in Theory & Practice
Fall 2019
Version 1.0
December 9th, 2019
By Sam Walkow (swalkow2) and Nikolaus Parulian (nnp2)
Scientific Software Citation Data Dictionary 1
Publication Keywords (About) 5
Extensibility/Future Directions 6
Entity Semantics Units Tables 9
Test Corpus / Project Proposal 46
This document is a data dictionary written to provide instructions on how to create metadata records for scientific software citation using a schema specifically designed to capture metadata about software applications that contribute to academic publications, how they contribute, and to give credit to software maintainers and contributors of those software applications. The types of metadata captured for this schema include:
The Scientific Software Citation schema is an RDF structured schema that defines entities (people, publications, software, etc) and properties of those entities (names, identifiers, keywords, etc) and that links relevant properties together. This schema uses several schema.org defined entities, and introduces new entities that are unique to this schema.
This schema is intended to illustrate how software supports academic publications by linking software usage and intended purpose directly to the outcomes shown in a publication such as figures, units of analysis, and discussion or conclusions in an effort to link quantitative results with the support software. Additionally, the schema describes the people and entities involved including the publication authors, software contributors and maintainers, funders and publishers. The schema aims to structure this metadata in a way that makes academic publication queries more sophisticated where publications could be searched by software used and vice versa. The purpose is twofold:
This data dictionary was written with the intention that authors would create metadata records for their own publications. However, the instructions should be comprehensive enough that anyone could create a scientific software citation metadata record.
This schema is intended to capture the metadata relevant at the time. This means that while some values are subject to change, the schema is meant to preserve items such as the version of software used at that time, by the contributors that were involved at that time. In this way, outcomes from software applications can be reproduced as the metadata accurately describes the necessary technical and preservation needs.
The Scientific Software Citation model is based on the RDF conceptual model and is designed to link publications with one or more authors, software applications, and contributors. Beyond that, it is also designed to link one or more software application with one or more outcomes (scientific product that came from the software, such as a chart or a type of analysis) and with one or more type of usage or intended usage (such as scientific visualization, numerical analysis, statistics, etc).
The schema is designed with flexibility on the author’s end in mind, so that the authors of the publication can record the usage and products/outcomes of the software as they used it. We realize this could mean that the same software application could be recorded with different uses and different outcomes depending on the software, and this is intended. What should remain constant is the publication as the center of the model, the unique identifiers for persons (authors or contributors) and their work (publication or software) so items can be searched by those entities and credit can be given.
The schema is a combination of items from schema.org and newly defined entities to describe the software linked to publications. Each entity type has several entity properties attached to it. Entity properties are meant to hold values that describe the publication and software relationship. Value types include text, numbers, date types, URLs, links to previous entities and blank nodes. There are more details in the definitions section.
Our custom schema entities are denoted with: Scs: Software Citation Schema. This is a customized schema that contains classes and properties we made as a compliment to the schema.org schema.
For the schema.org, we provided an Application Profile on how we use the schema.org classes and properties and fit it to the proposed Software Citation model.
As we can see from the figure, we have five major entities we want to capture using the schema.org by providing a custom Application Profile and introducing a new schema called Software Citation Schema (scs) that is supposed to complement and capture any parameters or properties that originally is not supported by schema.org. The explanation of each entity can be found on the Entity Semantic Unit List and Entity Semantic Unit Tables section.
For each entity, we also prepare a corpus target folder that is meant to separate different entities as follows:
http://softwarecitation.web.illinois.edu/corpus/author/
http://softwarecitation.web.illinois.edu/corpus/software/
http://softwarecitation.web.illinois.edu/corpus/repository/
http://softwarecitation.web.illinois.edu/corpus/developer/
A number of controlled vocabularies are used in the following entities, with links to the entity table with further details:
These terms describe the different platforms and formats an academic work can be published from.
These terms include the keywords used to tag a scholarly work, as seen in publications keywords section.
These terms include the names of different funding bodies behind academic publications and software.
These terms include the keywords used to tag software repositories, as seen in a Github tags section.
These terms include the names of different legal licenses behind software applications.
These terms describe the various purposes researchers might have for software in their research or for the publication.
In this section we cover a high level view of our data structure. We developed two semantic structures following the rdf schema:
Ex. scs:Author
Ex. schema:givenName
Below is a definition of each row in the entities table:
While this schema is intended to give as much credit as possible to all involved in a publication, we had to compromise on the software side, and have included instructions to only include the lead contributor or maintainer of a software application. Creators of the metadata records can repeat that entity and include as many code contributors as they want, however there are open source software applications that have hundreds of contributors at different levels of the project with different roles. It may be too arduous to include every contributor and this does limit the credit given to contributors, however automation may help with this in the future.
Along those lines, there isn’t an easy way to give credit to all the dependencies that a software application may be built on. For example, we use an open source visualization and analysis library called ‘yt’ as an example throughout this document, however we do not specify that yt depends on several other softwares such as Numpy and Matplotlib. There is space in this schema structure to specific as many software applications that depend on each other, however it is often the case that there are a large number of dependencies which is not reasonable to input manually. We hope automation in this area could make that addition to the schema possible.
Additional limitations in one way directions from some entities in our RDF structure. For example, schema:SoftwareSourceCode has space for a link to schema:SoftwareApplication, but not the other way around. This could limit the searchability impact this schema could have.
1. scs:Author
1.1 schema:givenName
1.2 schema:additionalName
1.3 schema:familyName
1.4 schema:affiliation
1.5 schema:identifier
1.6 schema:email
2. scs:ScholarlyArticle
2.1 schema:identifier
2.2 schema:name
2.3 schema:about
2.4 scs:author
2.4.1 scs:AuthorOrder
2.4.1.1 scs:author
2.4.1.2 scs:order
2.5 schema:abstract
2.6 schema:publisher
2.7 schema:datePublished
2.8 schema:publicationType
2.9 scs:useSoftware
2.9.1 scs:SoftwareOutcome
2.9.1.1 schema:identifier
2.9.1.2 scs:outcome
2.9.1.2.1 scs:Outcome
2.9.1.2.1.1 schema:articleBody
2.9.1.2.1.2 schema:pageStart
2.9.1.2.1.3 scs:usageType
2.10 schema:url
3. schema:SoftwareApplication
3.1 schema:name
3.2 schema:alternate
3.3 schema:description
3.4 schema:about
3.5 schema:relatedLink
3.6 schema:funder
3.7 schema:version
3.8 schema:license
3.9 schema:isBasedOn
4. schema:SoftwareSourceCode
4.1 schema:codeRepository
4.2 schema:isBasedOn
4.3 schema:targetProduct
4.4 schema:description
4.5 schema:contributor
5. scs:CodeMaintainer
5.1 schema:identifier
5.2 schema:name
5.3 schema:sameAs
Entity types | 1. scs:Author |
Entity properties | 1.1 schema:givenName 1.2 schema:additionalName 1.3 schema: familyName 1.4 schema:affiliation 1.5 schema:identifier 1.6 schema:email |
Subclass of | schema:Person |
Definition | This class is an application profile for scs:Author. This entity defines this item that will be credited with working or contributing to work that makes up a publication. An Author class should be used to define an author, or person involved in the publication or the software that supported it. This entity should be a person that is mentioned in the publication. |
Rationale | An author is included in scholarly publications. An author is an entity which is intended to be credited with the work of writing a publication. It can be found on the first page of a publication. |
Complete example | <author/#colinAllen> a scs:Author ; schema:identifier "orcid:/0000-0003-4497-1725"^^xsd:string ; schema:givenName "Colin"^^xsd:string ; schema:familyName "Allen"^^xsd:string ; schema:affiliation "Department of History and Philosophy of Science and Program in Cognitive Science, Indiana University"^^xsd:string ; schema:affiliation "Indiana University"^^xsd:string ; schema:email "colallen@indiana.edu" ; . |
Entity properties | 1.1 schema:givenName |
Definition | First or given name of the author of a publication |
Rationale | An author’s first name, which in the United States or western world is the word that appears first in the name and is part of how a person is identified. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | This entity can be repeated for multiple given names. Depending on the publication name format, often only the first initial or several initials are included. This can be included in givenNam. Include hyphenated names as one name. Example: From this first page of a publication Example:
|
Entity properties | 1.2 schema:additionalName |
Definition | Middle name or initial of the author of a publication |
Rationale | An author’s middle name, which in the United States or western world is the word that appears second in a name. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Include hyphenated names as one name. This entity can be repeated for multiple additional names. Some publications only include the first initial or several initials, which can also be included in givenName. Example: Use the same source as 1.1
schema:additionalName “J.” ;
schema:additionalName “H.” ; |
Entity properties | 1.3 schema:familyName |
Definition | Last name of the author of a publication |
Rationale | An author’s last name, which in the United States or western world is the word that appears last in a name is often how authors are identified and how they receive credit for their work. |
Data constraint | Text |
Obligation | Mandatory |
Repeatable | No |
Usage notes | Include hyphenated names as one name. This entity cannot be repeated to indicate multiple family names (since this is rare). Since this entity is how many authors are known, this field is mandatory so credit can be given to the author. If an author name is unclear or not formated in the western style, include the entire name is this entity. Example: Use the same source as 1.1
a scs:Author; schema:givenName “Matthew” ; schema:additionalName “J.” ; schema:familyName “Turk” ;
schema:familyName “Clark” ;
schema:familyName “Glover” ;
schema:additionalName “H.” ; schema:familyName “Grief” ; |
Entity properties | 1.4 schema:affiliation |
Definition | An author’s affiliation to an institute, company, entity or group |
Rationale | An author’s affiliation is often included in publications and is an important note about the author’s professional identity. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Multiple affiliations can be accommodated with repeat entries, which can be text or a hyperlink. Include what is on the publication and in that format. We strongly recommend that only the name of the affiliation is used, as opposed to the name and address however that can be included. Example: Using the same source as 1.1
schema:additionalName “J.” ; schema:familyName “Turk” ; schema:affiliation “Center for Astrophysics and Space Science” ; schema:affiliation “University of California-San Diego” ;
schema:familyName “Clark” ; schema:affiliation “Zentrum fur Astronomie der Universitat Heidelberg” ; schema:affiliation “Institut fur Theoretische Astrophysik” ; |
Entity properties | 1.5 schema:identifier |
Definition | An author’s unique identification method, often an ORCID or some other unique value assigned to an author. |
Rationale | An author’s unique identification method so their work can be credited to the correct person. We strongly recommend the use of an ORCID. |
Data constraint | Text |
Obligation | Mandatory |
Repeatable | Yes |
Usage notes | We strongly recommend using an ORCID number, however any unique identifier can be used to uniquely describe the author. Be sure to specify the type of identifier you are using before entering the identifier itself. See example below. Identifier value must follow this formatting specification: <name_of_identifier>:/<value_of identifier> Controlled vocabulary for name_of_identifier:
ORCID can be found from https://orcid.org/orcid-search/search Multiple identifiers can be accommodated with repeat entries, which can be text or a hyperlink. Example:
schema:familyName “Walkow” ; schema:identifier “orcid:/0000-0001-7329-1863” If we are using email: schema:givenName “Samantha” ; schema:familyName “Walkow” ; schema:identifier “local:/swalkow” |
Entity properties | 1.6 schema:email |
Definition | Email address is a point of contact for an author so they can be reached and further credited with their work. |
Rationale | Email addresses are often, but not always, included on a publication although sometimes they are difficult to retrieve. An author can be contacted or identifier this way. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Include the email that is on the publication, which can usually be found on the front page of the publication or webpage. Other ways to contact an author can be included here, but please specify what the contact is for. Example:
|
Entity types | 2. scs:ScholarlyArticle |
Entity properties | 2.1 schema:identifier 2.2 schema:name 2.3 schema:about 2.4 scs:author 2.4.1 scs:AuthorOrder 2.4.1.1 scs:author 2.4.1.2 scs:order 2.5 schema:abstract 2.6 schema:publisher 2.7 schema:datePublished 2.8 schema:publicationType 2.9 scs:useSoftware 2.9.1 scs:SoftwareOutcome 2.9.1.1 schema:identifier 2.9.1.2 scs:outcome 2.9.1.2.1 scs:Outcome 2.9.1.2.1.1 schema:articleBody 2.9.1.2.1.2 schema:pageStart 2.9.1.2.1.3 scs:usageType |
Subclass of | schema:ScholarlyArticle |
Definition | A scholarly article is a written academic work, such as paper, article, journal article, chapter, conference paper, in either physical or digital form. The publication class is mainly derived from the class schema:ScholarlyArticle. This entity defines the paper, or item that relate to the actual work and later to be linked to authors and software applications. This is the center of the data model. One scholarly article entity should represent only one actual publication in the real world. |
Rationale | ScholarlyArticle and SoftwareApplications that support them are the two entities we are trying to establish a relationship between, and also to define the nature of that relationship. |
Complete Example | <article/#article1> a scs:ScholarlyArticle ; schema:identifier "doi:/10.1086/673276" ; schema:name "Cross-Cutting Categorization Schemes in the Digital Humanities"^^xsd:string ; schema:about "local:/digital humanities"^^xsd:string ; scs:author [ scs:Author <author/#collinAllen> ; scs:order 1 ; ] ; schema:abstract """Digital access to large amounts of scholarly text presents both challenges ….."""^^xsd:string ; schema:publisher "University of Chicago Press"^^xsd:string ; schema:datePublished "2013" ; schema:publicationType "journal" ; scs:useSoftware [ a scs:SoftwareOutcome ; schema:identifier <software/inpho> ; scs:outcome [ a scs:Outcome ; schema:articleBody """AT THE INDIANA PHILOSOPHY ONTOLOGY PROJECT (InPhO) we have developed and are continuing to develop methods for categorizing and linking philosophical ideas and thinkers."""^^xsd:string ; schema:pageStart "1" ; scs:usageType "development" ; ] ] . |
Entity properties | 2.1 schema:identifier |
Definition | A unique identifier for the individual publication. We strongly recommend used the DOI. |
Rationale | A unique identifier will ensure that individual publications can be found when searched and can be linked to other entities. |
Data constraint | Text |
Obligation | Mandatory |
Repeatable | Yes |
Usage notes | Identifier value must follow this formatting specification: <name_of_identifier>:/<value_of identifier> Controlled vocabulary for name_of_identifier:
We strongly recommend using the DOI, but other identifiers can be used as long as they are unique. If that is the case, please indicate what type of identifier you are using. Example:
schema:identifier “doi:/10.1088/0004-637X/726/1/55” ; |
Entity properties | 2.2 schema:name |
Definition | The name of the publication, as it appears on the publication |
Rationale | Names or titles are how publications are identified, they also describe the subject matter, and are often how people search for publications. |
Data constraint | Text |
Obligation | Mandatory |
Repeatable | No |
Usage notes | Include the name that is on the publication, which can usually be found on the front page of the publication or webpage. Example:
|
Entity properties | 2.3 schema:about |
Definition | Keywords describing the content of the publication |
Rationale | Keywords are often used as search terms and can identify a paper as part of a particular domain or area |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | We recommend you use the following controlled vocabulary from MESH (https://meshb.nlm.nih.gov/search), acm (https://dl.acm.org/ccs/ccs_flat.cfm), or use the keywords as shown on the front page of the publication. If the keywords are not from a controlled vocabulary, you can specify that they are ‘local’ values. Provide your about (keywords) as granularly as possible. ‘About’ value must follow this formatting specification: <name_of_keyword>:/<value_of_the_keyword> Controlled vocabulary for the name_of_keyword:
Example: Following the 2.1 example:
schema:about “local:/cosmology: theory” ; schema:about “local:/galaxies: formation” ; schema:about “local:/stars: formation” ; schema:about “local:/regions” ; Another example from acm
schema:about “acm:/Concurrent Programming” ; schema:about “acm:/Processors—compilers” ; schema:about “acm:/Automatic Programming-program transformation” ; schema:about “local:/compilation” ; schema:about “local:/dependence analysis” ; schema:about “local:/vectorization” ; |
Entity properties | 2.4 scs:Author |
Definition | The order of the authors, if applicable, as they appear on the publication |
Rationale | Author order can be an indication of the amount of work contributed, seniority, or other socially important factors and should be included in the record to help describe record accuracy. This may also be a factor in search terms. |
Data constraint | scs:Author; scs:AuthorOrder; |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Details: scs:Author: refers to predefined entity from class Author scs:AuthorOrder: is a blank node entity that provide flexibility on preserving ordinality of the author given a ScholarlyArticle entity Record the authors by name in a list, indicating the order as they appear on the publication. Example:
schema:name “EFFECTS OF VARYING THE THREE-BODY MOLECULAR HYDROGEN FORMATION RATE IN PRIMORDIAL STAR FORMATION” ; scs:author <#author1> ;
schema:name “EFFECTS OF VARYING THE THREE-BODY MOLECULAR HYDROGEN FORMATION RATE IN PRIMORDIAL STAR FORMATION” ; scs:author [ a scs:AuthorOrder ; scs:Author <#author1>; scs:order: 1 ; ] ; scs:authorOrder [ a scs:AuthorOrder ; scs:Author <#author2> ; scs:order: 2 ; ] ; |
Entity properties | 2.5 schema:abstract |
Definition | The abstract of the publication, which is a brief summary of the publication contents at the beginning of the publication. |
Rationale | Abstracts are used to provide a brief overview of the publication and often appear in the online searches which help users determine if they will read further. |
Data constraint | Text |
Obligation | Optional |
Repeatable | No |
Usage notes | Use the abstract as it appears on the publication. Example:
schema:abstract “The transformation of atomic hydrogen to molecular hydrogen through three-body reactions is a crucial stage in the collapse of primordial, metal-free halos, where the first generation of stars (Population III stars) in the universeis formed. However, in the published literature, the rate coefficient for this reaction is uncertain by nearly an order of magnitude. We report on the results of both adaptive mesh refinement and smoothed particle hydrodynamics simulations of the collapse of metal-free halos as a function of the value of this rate coefficient. For each simulation method, we have simulated a single halo three times, using three different values of the rate coefficient. We find that while variation between halo realizations may be greater than that caused by the three-body rate coefficient being used, both the accretion physics onto Population III protostars as well as the long-term stability of the diskand any potential fragmentation may depend strongly on this rate coefficient.” |
Entity properties | 2.6 schema:publisher |
Definition | The name of the publisher such as the journal or book publisher |
Rationale | Often users search by publisher when looking for scholarly articles, which can also indicate the subject matter. |
Data constraint | schema:Organization ; schema:Person ; |
Obligation | Optional |
Repeatable | Yes |
Usage notes | The publisher information should be included at the top of the front page of the publication, or the website it is hosted on. We are only expecting the journal name, but you could include the page and volume details as they appeared on the publication. Example:
schema: publisher “The Astrophysical Journal” ; |
Entity properties | 2.7 schema:datePublished |
Definition | The date the publication was published |
Rationale | Publication dates can indicate relevance, and are also used as search parameters by users. |
Data constraint | Date |
Obligation | Optional |
Repeatable | No |
Usage notes | The date published information should be included at the top of the front page of the publication, or the website it is hosted on. We recommend your follow the format “year-month-day” (YYYY-MM-DD) if it is a text. Example:
schema: datePublished: “2011-01-05” |
Entity properties | 2.9 schema:PublicationType |
Definition | The type of publication, as in what format did the publication appear in when published. |
Rationale | Different types of publications can indicate the purpose, review process, and domain. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | We strongly recommend using a term from the MESH controlled vocabulary https://www.nlm.nih.gov/mesh/pubtypes.html. For example:
Example:
schema:publicationType “journal article” ; |
Entity types | 2.9 scs:citeSoftware |
Definition | This property represents citations of software from a publication to the SoftwareApplication entity. One publication can have many citations / links to any SoftwareApplication entity. Besides the software application linked to the citation entity, we can also provide outcomes to capture the usage of the software and how it is represented on the paper. This is important to determine the usage of the software on the paper. |
Rationale | Because we want to understand the relation between publication and software, an entity of publication must be created first. One record of this software citation property must have one attachment to the SoftwareApplication or scs:SoftwareOutcome. At least one ScholarlyArticle entity must have at least one scs:citeSoftware property for the purpose of this schema. The scs:SoftwareOutcome entity can bring information about the usage of software in the article. Coder can get this value by looking at the article and annotated figure/text that mention the usage of software in the paper. |
Data constraint | schema:SoftwareAplication; scs:SoftwareOutcome; |
Obligation | Required |
Repeatable | Yes |
Usage notes | Example:
schema:identifier “10.1088/0004-637X/726/1/55” ; scs:citeSoftware <http://software-citation.org/sw/yt> ;
schema:identifier “doi:/10.1088/0004-637X/726/1/55” ; scs:citeSoftware [ a scs:SoftwareOutcome ; schema:identifier <http://software-citation.org/sw/yt> ; scs:outcome [ a scs:Outcome ; schema:articleBody “Figure 1. Script that load” ; scs:usageType “visualization” ; ] ] Details about this SoftwareOutcome entity explained on the 2.10.1 SoftwareOutcome. |
Entity types | 2.9.1 scs:SoftwareOutcome |
Entity properties | 2.9.1.1 schema:identifier 2.9.1.2 scs:outcome 2.9.1.2.1 scs:Outcome |
Subclass of | schema:Thing; |
Definition | A class definition for the sub property scs:citeSoftware of Software Citation type. This class provides a schema that give a more detailed explanation about the cited software and it outcome that stated in the paper/publication |
Rationale | The usage of this class is to add more flexibility on defining the outcome for the usage of software that is presented on the paper / publication. This class is meant to be used as a blank nodes and present/link only to one software citation on the ScholarlyArticle entity. |
Entity properties | 2.9.1.1 schema:identifier |
Definition | Link to the cited software application. |
Rationale | At least one publication must have a SoftwareApplication citation to build the software citation corpus. This identifier must refer to the predefined SoftwareApplication entity. |
Data constraint | schema:SoftwareAplication; |
Obligation | Required |
Repeatable | No |
Usage notes | Example:
schema:identifier <http://software-citation.org/sw/yt> |
Entity properties | 2.9.1.2 scs:outcome |
Definition | More detailed explanation about the usage of software in the article |
Rationale | The scs:SoftwareOutcome entity can bring information about the usage of software in the article. To create a record, one can get this value by looking at the article and annotated figure/text that mention the usage of software in the paper |
Data constraint | schema:Article scs:OutCome |
Obligation | Required |
Repeatable | Yes |
Usage notes | Example:
schema:SoftwareApplication [ a scs:SoftwareOutcome ; schema:identifier <http://software-citation.org/sw/yt> ; scs:outcome [ a schema:Article ; schema:articleBody “Figure 1. Script that load” ] ]
schema:SoftwareApplication [ a scs:SoftwareOutcome ; schema:identifier <http://software-citation.org/sw/yt> ; scs:outcome [ a scs:Outcome ; schema:articleBody “Figure 1. Script that load” scs:usageType “visualization” ] ] More details about scs:Outcome class explained on 5.2.1.2.1 Outcome |
Entity types | 2.10.1.2.1 scs:OutCome |
Entity properties | 2.10.1.2.1.1 schema:articleBody 2.10.1.2.1.2 schema:pageStart 2.10.1.2.1.3 scs:usageType |
Subclass of | schema:Article |
Definition | A sub entity of the scs:SoftwareOutcome. This class is a subclass of schema:Article which can define the body of the article, section and page where citation occurs. Besides that, there is a scs:usageType property that can help define the usage of the cited software application on the publication. |
Rationale | So outcomes can be accurately described and found within a scholarly article. This is also where outcomes can be assigned a type, which could be used as a search term later on. |
Entity properties | 2.9.1.2.1.1 schema:articleBody |
Definition | Part of the article that mentioned, use or cite the software |
Rationale | This properties and entity text linked with this property can be a supporting statement for the publication that citing SoftwareApplication |
Data constraint | schema:Text |
Obligation | Required |
Repeatable | Yes |
Usage notes | Example: schema:SoftwareApplication [ a scs:SoftwareOutcome ; schema:identifier <http://software-citation.org/sw/yt> ; scs:outcome [ a scs:Outcome ; schema:articleBody “Figure 1. Script that load” ; scs:usageType “visualization” ; ] ] |
Entity properties | 2.9.1.2.1.2 schema:pageStart |
Definition | Part of the article that mentioned, use or cite the software |
Rationale | One outcome property can have one pageStart which define the page location of the property. The page number is relative to the article page (not the whole volume page) |
Data constraint | schema:Numeric |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example: schema:SoftwareApplication [ a scs:SoftwareOutcome ; schema:identifier <http://software-citation.org/sw/yt> ; scs:outcome [ a scs:Outcome ; schema:articleBody “Figure 1. Script that load” ; schema:pageStart 2 ; scs:usageType “visualization” ; ] ] |
Entity properties | 2.9.1.2.1.3 scs:usageType |
Definition | If a user links the softwareCitation to the scs:OutCome class, the entity must have at least the usageType to distinct the usage of this class with the schema :SoftwareApplication class |
Rationale | We want to understand the usage of the software in the publication. Is it used for visualization, computation, workflow, etc. Strongly recommended to use control vocabulary |
Data constraint | schema:Text ; Use control vocabulary:
|
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example: schema:SoftwareApplication [ a scs:SoftwareOutcome ; schema:identifier <http://softwarecitation.web.illinois.edu/sw/yt> ; scs:outcome [ a scs:Outcome ; schema:articleBody “Figure 1. Script that load” ; schema:pageStart 2 ; scs:usageType “visualization” ; ] ] |
Entity properties | 2.10 schema:url |
Definition | URL or web page where we can find a Digital copy of the real scholarly article we are recording in this entity. |
Rationale | In the future we want to presumably check the actual work (article) we annotate on this entity. This url will provide a limited provenance of the origin of an article. |
Data constraint | schema:URL |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example:
schema:url <https://www.journals.uchicago.edu/doi/abs/10.1086/673276> ; |
Entity types | 3. schema:SoftwareApplication |
Entity properties | 3.1 schema:name 3.2 schema:alternate 3.3 schema:description 3.4 schema:about 3.5 schema:relatedLink 3.6 schema:funder 3.7 schema:version 3.8 schema:license 3.9 schema:isBasedOn 3.10 schema:softwareRequirements |
Subclass of | schema:SoftwareApplication |
Definition | The software application class is mainly derived from the class schema:SoftwareApplication. This is an entity that defines the software that later will be attached to the ScholarlyArticle entity that uses this entity. The software can be defined as libraries, open source code packages, or application. |
Rationale | The purpose to have SoftwareApplication entity for this metadata is to explain the usage of particular open source Software in the publication. It is strongly recommended to use this entity for Open Source software only. |
Complete Example | <software/#inPho> a schema:SoftwareApplication ; schema:name "The InPhO Project"^^xsd:string ; schema:alternate "Internet Philosophy Ontology (InPhO) project"^^xsd:string ; schema:alternate "Indiana Philosophy Ontology"^^xsd:string ; schema:description """Indiana Philosophy Ontology (InPhO) project, which uses a combination of automated methods and expert feedback to create a dynamic computational ontology for the discipline of philosophy"""^^xsd:string ; schema:about "local:/data mining"^^xsd:string ; schema:about "local:/natural language processing"^^xsd:string ; schema:about "local:/Expert Feedback"^^xsd:string ; schema:about "local:/Machine Reasoning"^^xsd:string ; schema:relatedLink <https://www.inphoproject.org> ; schema:funder "NEH"^^xsd:string ; schema:license "CC BY-NC-SA 3.0"^^xsd:string ; schema:softwareRequirements "browser" ; . |
Entity properties | 3.1 schema:name |
Definition | Name of the Software / Application / Packages. |
Rationale | A software must have a name in order to be known by the public and to exist in the code repository. Software Application names are typically unique. |
Data constraint | Text |
Obligation | Mandatory |
Repeatable | No |
Usage notes | It is strongly recommended to use the same name with the package name or common name for the software or from the repository.
For other repository, please specify the name of the repository and url in the download link. Example:
|
Entity properties | 3.2 schema:alternate |
Definition | Alternate name that is known by the public. Can also be the long name if the package name is an abbreviation. Any name that can provide more information or explain that this is the same name for this SoftwareApplication. |
Rationale | Sometimes package name is not meaningful enough or not distinctive enough for some users. This alternate name can be a helpful information if there exists a name that properly known by the public. Besides that, a package name can also use abbreviation. The alternatives name can provide an additional information of the Library / Software. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | One property of schema:alternate must define only one alternate name. If it has multiple alternate name then please define it by multiple properties. No ordinality preserves in the property. Example:
schema:alternate “yours truly” ;
schema:alternate “SciPy Toolkits” ; |
Entity properties | 3.3 schema:description |
Definition | A free form sentence that describing the Software Application. |
Rationale | This is a descriptive property that explain or provide more details about the Software Application. The information derived for this property can be derived from any sources. However, if exist, we strongly recommend to use the description information from the package repository. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | For Python library, description can be found in the PyPi page. In the future, this value is ideally can be automated if the URL for the package is provided. Example: This is how you can derived the description from the PyPi page
schema:description “yt is an open-source, permissively-licensed python package for analyzing and visualizing volumetric data. Yt supports….” ; |
Entity properties | 3.4 schema:about |
Definition | Keywords that can describe the usage or objective of the library |
Rationale | Software Application might have a specific purpose of why it is developed. This property can give more information about the particular criterias of how this Software Application entity can be useful. |
Data constraint | Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | It is strongly recommended to use the control vocabulary, or use the keyword on the Package repository if it exists. For example, in Pypi repository for Python, the control vocabulary can be found in the Topic column. This also can be automated if the url is given. This is how we can get this value from the Python package manager:
Example:
Schema:about “Scientific/Engineering :: Astronomy” ; Schema:about “Scientific/Engineering :: Physics” ; Schema:about “Scientific/Engineering :: Visualization” ; |
Entity properties | 3.5 schema:relatedLink |
Definition | Homepage of the library/software or where we can find and download the software |
Rationale | Provided url can be useful to derive some information that can be scrapped automatically. For example, the schema:about and schema:description information can be captured automatically if the url to the PyPi package is provided. The automated script is not provided in this schema. |
Data constraint | schema:URL |
Obligation | Optional |
Repeatable | Yes |
Usage notes | For the Python package this is the link to the PyPi page. For R is the link to the R repo page For the Python Package you can get the name from the http://pypi.org as explained on the 3.4 section For the R, you can get the name from: https://cran.r-project.org/web/packages/available_packages_by_name.html Example:
Schema:relatedLink <https://pypi.org/project/yt/> ;
Schema:alternate “glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models” ; Schema:relatedLink <https://cran.r-project.org/web/packages/glmnet/index.html> ; |
Entity properties | 3.7 schema: version |
Definition | Version of the Software Application. Is an optional element to provide more details if the software used on the publication refers to a specific version. |
Rationale | Software Applications often have multiple versions and a publication might use a specific version for a reason or it is just simply the version that exist at the time the paper is developed / published. This can help with reproducibility of an outcome. |
Data constraint | Text We strongly suggest to follow the Python versioning or R versioning format if user using Python or R library. |
Obligation | Optional |
Repeatable | No |
Usage notes | This is how you can get version number on R or Python
The circled number on this yt PyPi page example (https://pypi.org/project/yt/) is representing the latest version available on the repository. To see different versions of this particular pypi package, we can click the release history menu on the left. It will show the version and date when it is released.
schema:relatedLink <https://pypi.org/project/yt/> ; schema:version “3.5.1” ; For R: We can get the version in the highlighted section on the R package page.
schema:relatedLink <https://cran.r-project.org/web/packages/glmnet/> ; schema:version “3.6-1” ; |
Entity properties | 3.8 schema:license |
Definition | Define open source license that is applied on the Software Application. One software application by nature can have multiple open source license. |
Rationale | Open source license is an important part of the Open source development and application because it determines how the open source can be published, reused, or modified by the users. Some licenses do not allow commercial use, some have the infectious attribute such as GPL which requires the user to apply GPL license to the application that uses the GPL software if it is published. Capturing this information is also important if we want to compare the distribution of open source licenses in the open source community if we have enough collection or able to automate this process in the future. |
Data constraint | Text We strongly suggest to only use values from the list of open source license : https://opensource.org/licenses/alphabetical |
Obligation | Optional |
Repeatable | Yes |
Usage notes | To get this value from the PyPi package manager
The circled part on this yt PyPi page example (https://pypi.org/project/yt/) is showing the license attached to this package. For most of the packages that include license information we can find it on the left menu, Meta section.
schema:relatedLink <https://pypi.org/project/yt/> ; schema:version “3.5.1” ; schema:license “BSD License (BSD 3-Clause)”; For R: License can be found on the highlighted section on the R package page.
schema:relatedLink <https://cran.r-project.org/web/packages/glmnet/> ; schema:version “3.6-1” ; Schema:license “GPL-2” ; |
Entity properties | 3.9 schema:isBasedOn |
Definition | If a software application is derived or is referring to a specific version, or if the application is a submodule of a bigger project, this property will preserve the value and provide a link to the work/project associated with this SoftwareApplication entity. |
Rationale | A SoftwareApplication project or library might have submodule or other greater works that related to this application. |
Data constraint | schema:SoftwareApplication |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Use only another entity of SoftwareApplication that is already defined on the dataset. Example:
schema:isBasedOn <http://software-citation.org/sw/yt> ; |
Entity properties | 3.10 schema:softwareRequirements |
Definition | A library must have requirement for a programming language used to run the library such as java, r, or python. |
Rationale | This property explains how we can use the necessary component of SoftwareApplication entity to run or operate properly. |
Data constraint | schema:Text : use control vocabulary
schema:SoftwareApplication: predefined entity of Software Application |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example:
schema:softwareRequirements “python” ;
schema:softwareRequirements “r” ; |
Entity types | 4. SoftwareSourceCode |
Entity properties | 4.1 schema:codeRepository 4.2 schema:isBasedOn 4.3 schema:targetProduct 4.4 schema:description 4.4 schema:contributor |
Subclass of | schema:SoftwareSourceCode |
Definition | The SoftwareSourceCode class is mainly derived from the class schema:SoftwareSourceCode. This is an entity that define the location of the source code in the internet or open source repository such as Github, Gitlab, or SVN. |
Rationale | This entity preserves information about the source code and later can be useful to track the source code based on the version of the SoftwareApplication. Besides that, this entity will later will be the parent property of contributors where we can look at the people that contribute to the application development based on the source code. |
Complete Example | <repository/#inPho> a schema:SoftwareSourceCode ; schema:codeRepository <https://github.com/inpho/> ; schema:targetProduct <software/#inPho> ; schema:description "Internet Philosophy Ontology (InPhO) Project main repository. Contains several sub items related to the InPho project" ; schema:contributor <developer/#jaimieMurdock> ; schema:contributor <developer/#colinAllen> ; schema:contributor <developer/#kirtanSakariya> ; schema:contributor <developer/#sriramIyer> ; |
Entity properties | 4.1 schema:codeRepository |
Definition | URL for the repository, where we can find the source code of a software application |
Rationale | In the open source community, everyone can contribute to the development of the software. Bug fixing, feature improvement, is part of the development lifecycle. The repository URL can provide information that is needed to measure the contributions in the project |
Data constraint | schema:URL |
Obligation | Required |
Repeatable | No |
Usage notes | Example:
schema:codeRepository <https://github.com/yt-project/yt> ; |
Entity properties | 4.2 schema:isBasedOn |
Definition | Other software or Publication that inspired or has relation to this this software repository development |
Rationale | For a software development based on publication, this can be linked to the predefined publication. For a group of developers that continue or fork another open source project can provide the SoftwareApplication or URL to other codeRepository. |
Data constraint | schema:SoftwareApplication, schema:Publication, schema:SoftwareSourceCode, schema:URL We strongly recommend using the predefined SoftwareApplication, Publication or SoftwareSourceCode entities on the dataset that are strongly related to this source code or repository. |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example:
schema:codeRepository <https://github.com/yt-project/unyt> ; schema:isBaseOn <https://github.com/yt-project/yt> ; This entity is explaining the repository unyt that is a work that based on yt. |
Entity properties | 4.3 schema:targetProduct |
Definition | The targeted libraries, package or binary in which this source code used for or compiled. |
Rationale | For a software development based on publication, this can be linked to the predefined publication. For a group of developer that continue or fork another open source project can provide the SoftwareApplication or URL to other codeRepository |
Data constraint | schema:SoftwareApplication |
Obligation | Required |
Repeatable | No |
Usage notes | This property must link to the predefined SoftwareApplication entity.
schema:codeRepository <https://github.com/yt-project/yt> ; schema:targetProduct <http://software-citation.org/sw/yt> ; |
Entity properties | 4.4 schema:description |
Definition | Additional information of the repository, most of the time we can use the text describes the repository or use the repository title, or description |
Rationale | Summary of content from the software repository. Often times we can get this value from the front page of repository url or from the README.md file on the repository. |
Data constraint | schema:Text |
Obligation | Optional |
Repeatable | Yes |
Usage notes | Example:
schema:description “A toolkit for analysis and visualization of volumetric data ….” |
Entity properties | 4.5 schema:contributor |
Definition | Who are the people that contribute to the development of this software. Participating in the code development, future enhancement or bug fixing. |
Rationale | Source code in open source community mostly have more than one contributors. Sometimes the contributor can also be the author if the software is based on publication, but most of the time the coders does not have association with publication as well. This property will link to the scs:CodeMaintainer who are contributing to the source code to give proper attribution to the software maintainer, bug fixer, and developer. |
Data constraint | scs:CodeMaintainer; |
Obligation | Optional |
Repeatable | Yes |
Usage notes | For now, this property doesn’t preserve ordinality. Therefore it can be repeatable without maintaining the order of the contribution. This property must attached to a predefined scs:CodeMaintainer entity. Example:
schema:contributor <http://software-citation.org/cm/mathewTurk> |
Entity types | 5. scs:CodeMaintainer |
Entity properties | 5.1 schema:identifier 5.2 schema:name 5.3 schema:sameAs |
Subclass of | scs:Author |
Definition | This entity represents contributor of particular source code |
Rationale | Sometimes a coder / developer of open source applications does not related to the publication author. This entity will represent people that are working with the application code development but does not have relation with the publication authorship, we call this entity CodeMaintainer. |
Complete Example | <developer/#jaimieMurdock> a scs:CodeMaintainer ; schema:identifier <https://github.com/JaimieMurdock> ; schema:name "Jaimie Murdock" ; schema:sameAs <author/#jaimieMurdock> ; . |
Entity properties | 5.1 schema:identifier |
Definition | Identifier of the code maintainer. |
Rationale | A code maintainer must have identifier related with the work that they has been done on the software. Because we are working on open source software, the code maintainer can be easily derived from the Repository of the related code location. For better preservation, we strongly recommend to use URL of the code maintainer from the repository. In the future, this value can automatically derived if the repository provide API call/function to get the list of code maintainer. |
Data constraint | schema:URL |
Obligation | Required |
Repeatable | No |
Usage notes | This is how you can get the CodeMaintainer value from the application repository. From the project page, click the contributors tab on the top right You will get the detail contributors page, and you can click the link on the name The name contains url that redirect the browser to the user page. Use this URL for the contributor ID Example: <cm/#JamieMurdock> a scs:CodeMaintainer ; schema:identifier <https://github.com/JaimieMurdock>; |
Entity properties | 5.2 schema:name |
Definition | Name of the code maintainer |
Rationale | Code maintainer can have different name propose on their repository user page. We strongly recommend to use this property to represent the display name on the user page if there is no information about the real name |
Data constraint | schema:Text |
Obligation | Required |
Repeatable | Yes |
Usage notes | Same with the identifier, this value should be derived from the user page Example: <cm/#JamieMurdock> a scs:CodeMaintainer ; schema:identifier <https://github.com/JaimieMurdock>; schema:name “Jamie Murdock” ; |
Entity properties | 5.3 schema:sameAs |
Definition | If this developer is also an Author of a publication, use this property to link the contributor to Author entity |
Rationale | Sometimes if the open source code is based on publication, we should have a link to stitch the application developer and authorship of the publication. This property will preserve that information in which we can provide information and see some intersection when we want to query the data. |
Data constraint | scs:Author |
Obligation | Optional |
Repeatable | Yes |
Usage notes | <cm/#JamieMurdock> a scs:CodeMaintainer ; schema:identifier <https://github.com/JaimieMurdock>; schema:name “Jamie Murdock” ; schema:sameAs <ra/#JamieMurdock> ; |
The first publication about InPHO, talks about why and how they developed this software (purpose and goal).
Example: http://softwarecitation.web.illinois.edu/corpus/article/#inphoarticle1
Combining Hathi Trust Research Center (HTRC) data capsule and InPho software will be a good use case of associating two different open source packages in one publication.
Example: http://softwarecitation.web.illinois.edu/corpus/article/#inphoarticle2
https://www.researchgate.net/publication/267806761_InPhO_for_All_Why_APIs_Matter
This paper talks about the usage of one particular function in the open-source library.
Example: http://softwarecitation.web.illinois.edu/corpus/article/#inphoarticle3
This paper talks about the development community of this project.
Example: http://softwarecitation.web.illinois.edu/corpus/article/#inphoarticle4
This data dictionary was created to add support for scientific software citations with the intention of crediting all humans, entities and software involved. We aimed to create an RDF structured schema that would connect all components accurately.