The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. NIF consists of specifications, ontologies and software (overview), which are combined under the version identifier "NIF 2.0", but are versioned individually.
This document specifies the core of NIF, consisting of:
NIF is primarily designed to store and transfer text and text annotations.
In order to enter the NIF and RDF world, the text, also called the primary data, must be (1) converted to an RDF literal as an object of the
nif:isString property and (2) we require a way to programatically mint URIs to add annotations to the text. In the example below annotations can be added to the <SubjectURI> which serves as the context, i.e. a representative for the string in
<SubjectURI> nif:isString "Your text, e.g. a single sentence or the content of a whole document; bascially any sequence of characters." .
curl --data-urlencode input="My favourite actress is Natalie Portman." -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php"
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . <http://nlp2rdf.lod2.eu/nif-ws.php#char=0,40> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "40" ; nif:isString "My favourite actress is Natalie Portman." .
http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124as non-information resource URI (a global identifier independent of the data and representation). A web server such as Apache can now be configured to return various information resources via content negotiation (HTTP ``Accept:'' header) and ``303 - See Other'' redirects as is common practice in Linked Data:
text/plain303-redirects to david_lynch_dune_quoteid_124.txt
text/html303-redirects to an HTML visualization: david_lynch_dune_quoteid_124.php
text/turtle303-redirects to RDF in Turtle: david_lynch_dune_quoteid_124.ttl
application/json303-redirects to RDF in Json-LD: david_lynch_dune_quoteid_124.json
application/rdf+xml303-redirects to RDF in JSON-LD: david_lynch_dune_quoteid_124.owl
<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=0,600> rdf:type nif:RFC5147String , nif:Context ; nif:sourceUrl <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124.txt> nif:beginIndex "0" ; nif:endIndex "600" ; nif:isString """# Quote 124 from David Lynch's Dune ...
curl --data-urlencode input@david_lynch_dune_quoteid_124.txt --data-urlencode prefix="http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#" -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php" > david_lynch_dune_quoteid_124.ttl
According to the RDF 1.1 specification (3.3 Literals), RDF literals are Unicode strings, which should be in Normal Form C (NFC). In NIF, we will follow this recommendation in general. There are, however, circumstances which require the use of Normal Form D (NFD) or even NFKC or NFKD. Therefore NIF allows NFD, NFKC and NFKD, if the use case justifies the usage.
One such use case is, if a linguistic annotator has the requirement to annotate individual diacritics or parts of precomposed characters and syllables. For linguists with this use case or applicable languages, using NFD is obvious and well-justified. We will only give examples here and refer the interested reader to these three documents: Gernot Katzer's page about the Korean Writing system, Wikipedia article about the Korean Hangul, Unicode Normal Form specification.Example 1 (taken from the Unicode Normal Form spec):
Composed (NFD and NFC): ﬁ or ñ Decomposed (NFD and NFC): f , i or n ~
Precomposed Hangul 훯, three conjoining Jamo (H+WEO+LH) 훯, the same three Jamo enclosed in some markup to prevent their joining 훠 ᆶ and three Compatibility Jamo ㅎㅝㅀ. Ideally, only the first two should render identically as compound Hangul.
"ä".lenght() == 1
echo -n "ä" | wcis 2
nif:Contextalways refer to the content of the
nif:isStringproperty. One of the topics, during the creation of the RDF specification, was to allow literals as subjects in RDF statements (Discussion summary). The discussion concluded that in principle, there were no predominant technical reasons to deem this approach infeasible. Notation 3 even permits literals as subjects of statements. Therefore instances of nif:Context could be considered as:
<http://example.com/demo?cid=83848#char=0,40> owl:sameAs "My favourite actress is Natalie Portman." .
"My favourite actress is Natalie Portman." rdf:type nif:Context .
nif:sourceUrl, which is a subproperty of
prov:hadPrimarySource, to link nif:Context to documents.
<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=0,600> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "600" ; nif:sourceUrl <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124.txt> nif:isString "# Quote 124 from David Lynch's Dune ...
nif:wasConvertedFromwhich is a subproperty of
prov:wasDerivedFrom. For each nif:Context, taken out of another nif:Context, implementers must provide a
nif:wasConvertedFromprovenance link between these contexts. Note the change of the prefix in the following example.
<http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124_sentence1#char=0,44> rdf:type nif:RFC5147String , nif:Context ; nif:beginIndex "0" ; nif:endIndex "44" ; nif:wasConvertedFrom <http://persistence.uni-leipzig.org/nlp2rdf/specification/example/david_lynch_dune_quoteid_124#char=47,91> nif:isString "It is by will alone I set my mind in motion. """ .
prefixpart of the URI scheme and the remainder (i.e. ``char=717,729'') will be called the
identifierpart. NIF recommends the prefix to end on slash ('/'), hash (‘#’) or on a query component ('?').
In order to improve conformance with this specification, we provide a validator that can help implementers to systematically detect errors in their NIF ouptut. An up-to-date version can be downloaded at http://persistence.uni-leipzig.org/nlp2rdf/specification/validate.jar (We plan host an online web service, soon). The validator is the one important step to an interoperable NIF implementation. Implementers MUST validate their tool output with the validator. The validator itself is a valid NIF implementation and follows this specification as well as the Public API Specification.
./validate.jar -v or ./validate.jar -h
cat file.ttl | ./validate.jar -i - #or ./validate.jar -i file.ttl -t file
curl --data-urlencode input="My favourite actress is Natalie Portman." -d informat=text "http://nlp2rdf.lod2.eu/nif-ws.php" |\ ./validate.jar -i - --outformat text
--outformat textgives you a human readable answer, while the default output is RDF using the RLOG - RDF Logging Ontology. More technical information is documented at the README. The used SPARQL queries can be found here: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl