Sometimes, it is practically or legally not possible to move corpus data to a local machine. This vignette explains the usage of CWB corpora that are hosted on an OpenCPU server.

## polmineR is throttled to use 2 cores as required by CRAN Repository Policy. To get full performance:
## * Use `n_cores <- parallel::detectCores()` to detect the number of cores available on your machine
## * Set number of cores using `options('polmineR.cores' = n_cores - 1)` and `data.table::setDTthreads(n_cores - 1)`

Remote Corpora

Publicly Available Corpora

The GermaParl corpus is hosted on an OpenCPU server with the IP (subject to change). To use the corpus, use the corpus()-method. The only difference is that you will need to supply the IP address using the argument server.

gparl <- corpus("GERMAPARL", server = "")

The gparl object is an object of class remote_corpus.


Using polmineR core functionality

The polmineR at this stage exposes a limited set of its functionality for remote corpora. Simple investigations in the remote corpus are possible.

Get corpus size


Get structural annotation (metadata)



gparl2006 <- subset(gparl, year == "2006")

The returned object has the class remote_subcorpus.


Simple count

count(gparl, query = "Integration")

The count()-method works for remote_subcorpus objects, too.

count(gparl2006, query = "Integration")


kwic(gparl, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Works for the remote_subcorpus, too.

kwic(gparl2006, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Restricted Corpora

  1. Create directory for registry file-style files with credentials

  2. Create file with credentials for your corpus in this directory

Note: Filename is corpus id in lowercase

## registry entry for corpus GERMAPARLSAMPLE

# long descriptive name for the corpus
NAME "GermaParlSample"
# corpus ID (must be lowercase in registry!)
ID   germaparlsample
# path to binary data files
HOME http://localhost:8005
# optional info file (displayed by ",info;" command in CQP)

# corpus properties provide additional information about the corpus:
##:: user = "YOUR_USER_NAME"
##:: password = "YOUR_PASSWORD"
  1. Set environment variable “OPENCPU_REGISTRY” in .Renviron to dir just mentioned

  2. Get server whereabouts

x <- corpus("MIGPRESS_FAZ", server = "YOURSERVER", restricted = TRUE)

Next steps

Upcoming versions of polmineR will expose further functionality. This is a simple proof-of-concept!