The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. NIF consists of specifications, ontologies and software (overview), which are combined under the version identifier "NIF 2.0", but are versioned individually.
This document specifies the core of NIF, consisting of:
nif:wasConvertedFrom
provenance link.NIF is primarily designed to store and transfer text and text annotations.
In order to enter the NIF and RDF world, the text, also called the primary data, must be (1) converted to an RDF literal as an object of the nif:isString
property and (2) we require a way to programatically mint URIs to add annotations to the text. In the example below annotations can be added to the <SubjectURI> which serves as the context, i.e. a representative for the string in nif:isString
.
<SubjectURI> nif:isString "Your text, e.g. a single sentence or the content of a whole document; bascially any sequence of characters." .
curl --data-urlencode input="My favourite actress is Natalie Portman." -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php"
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . <http://nlp2rdf.lod2.eu/nif-ws.php#char=0,40> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "40" ; nif:isString "My favourite actress is Natalie Portman." .
http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124
as non-information resource URI (a global identifier independent of the data and representation).
A web server such as Apache can now be configured to return various information resources via content negotiation (HTTP ``Accept:''
header) and ``303 - See Other'' redirects as is common practice in Linked Data:
text/plain
303-redirects to david_lynch_dune_quoteid_124.txttext/html
303-redirects to an HTML visualization: david_lynch_dune_quoteid_124.phptext/turtle
303-redirects to RDF in Turtle: david_lynch_dune_quoteid_124.ttlapplication/ld+json
or application/json
303-redirects to RDF in Json-LD: david_lynch_dune_quoteid_124.jsonapplication/rdf+xml
303-redirects to RDF in JSON-LD: david_lynch_dune_quoteid_124.owl<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=0,600> rdf:type nif:RFC5147String , nif:Context ; nif:sourceUrl <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124.txt> nif:beginIndex "0" ; nif:endIndex "600" ; nif:isString """# Quote 124 from David Lynch's Dune ...
http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124.ttl
curl --data-urlencode input@david_lynch_dune_quoteid_124.txt --data-urlencode prefix="http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#" -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php" > david_lynch_dune_quoteid_124.ttl
According to the RDF 1.1 specification (3.3 Literals), RDF literals are Unicode strings, which should be in Normal Form C (NFC). In NIF, we will follow this recommendation in general. There are, however, circumstances which require the use of Normal Form D (NFD) or even NFKC or NFKD. Therefore NIF allows NFD, NFKC and NFKD, if the use case justifies the usage.
One such use case is, if a linguistic annotator has the requirement to annotate individual diacritics or parts of precomposed characters and syllables. For linguists with this use case or applicable languages, using NFD is obvious and well-justified. We will only give examples here and refer the interested reader to these three documents: Gernot Katzer's page about the Korean Writing system, Wikipedia article about the Korean Hangul, Unicode Normal Form specification.
Example 1 (taken from the Unicode Normal Form spec):Composed (NFD and NFC): fi or ñ Decomposed (NFD and NFC): f , i or n ~
Precomposed Hangul 훯, three conjoining Jamo (H+WEO+LH) 훯, the same three Jamo enclosed in some markup to prevent their joining 훠 ᆶ and three Compatibility Jamo ㅎㅝㅀ. Ideally, only the first two should render identically as compound Hangul.
"ä".lenght() == 1
strlen(utf8_decode("ä"))===1
len("ä".decode("UTF-8"))
echo -n "ä" | wc
is 2 strlen("ä")===2
len("ä")===2
nif:Context
always refer to the content of the nif:isString
property.
One of the topics, during the creation of the RDF specification, was to allow literals as subjects in RDF statements (Discussion summary).
The discussion concluded that in principle, there were no predominant technical reasons to deem this approach infeasible.
Notation 3 even permits literals as subjects of statements.
Therefore instances of nif:Context could be considered as:
<http://example.com/demo?cid=83848#char=0,40> owl:sameAs "My favourite actress is Natalie Portman." .
"My favourite actress is Natalie Portman." rdf:type nif:Context .
nif:sourceUrl
, which is a subproperty of prov:hadPrimarySource
, to link nif:Context to documents.
<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=0,600> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "600" ; nif:sourceUrl <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124.txt> nif:isString "# Quote 124 from David Lynch's Dune ...
nif:wasConvertedFrom
which is a subproperty of prov:wasDerivedFrom
.
For each nif:Context, taken out of another nif:Context, implementers must provide a nif:wasConvertedFrom
provenance link between these contexts.
Note the change of the prefix in the following example.
<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124_sentence1#char=0,44> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "44" ; nif:wasConvertedFrom <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=47,91> nif:isString "It is by will alone I set my mind in motion. """ .
prefix
part of the URI scheme and the remainder (i.e. ``char=717,729'') will be called the identifier
part.
NIF recommends the prefix to end on slash ('/'), hash (‘#’) or on a query component ('?').
In order to improve conformance with this specification, we provide a validator that can help implementers to systematically detect errors in their NIF ouptut. An up-to-date version can be downloaded at http://persistence.uni-leipzig.org/nlp2rdf/specification/validate.jar (We plan host an online web service, soon). The validator is the one important step to an interoperable NIF implementation. Implementers MUST validate their tool output with the validator. The validator itself is a valid NIF implementation and follows this specification as well as the Public API Specification.
./validate.jar -v or ./validate.jar -h
cat file.ttl | ./validate.jar -i - #or ./validate.jar -i file.ttl -t file
curl --data-urlencode input="My favourite actress is Natalie Portman." -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php" |\ ./validate.jar -i - --outformat text
-o text
or --outformat text
gives you a human readable answer, while the default output is RDF using the RLOG - RDF Logging Ontology. More technical information is documented at the README. The used SPARQL queries can be found here: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl