mirror of
https://github.com/opencloud-eu/opencloud.git
synced 2026-02-07 04:41:31 -05:00
133 lines
6.2 KiB
Markdown
133 lines
6.2 KiB
Markdown
# Search
|
|
|
|
The search service is responsible for metadata and content extraction,
|
|
the retrieved data is indexed and made searchable.
|
|
|
|
The search service runs out of the box with the shipped default `basic` configuration.
|
|
No further configuration is needed.
|
|
|
|
Note that as of now, the search service cannot be scaled.
|
|
Consider using dedicated hardware for this service in case more resources are needed.
|
|
|
|
## Search backends
|
|
|
|
To store and query the indexed data, a search backend is needed.
|
|
|
|
As of now, the search service supports the following backends:
|
|
|
|
- [bleve](https://github.com/blevesearch/bleve) (default)
|
|
- [opensearch](https://opensearch.org/)
|
|
|
|
### Bleve
|
|
|
|
Bleve is a lightweight, embedded full-text search engine written in Go and is the default search backend.
|
|
It is straightforward to set up and requires no additional services to run.
|
|
|
|
The following optional settings can be set:
|
|
|
|
* `SEARCH_ENGINE_BLEVE_DATA_PATH=/path/to/bleve/index` (default: `$OC_BASE_DATA_PATH/search`): Path to store the bleve index.
|
|
|
|
### OpenSearch
|
|
|
|
OpenSearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases.
|
|
Additionally, it provides advanced features like clustering, replication, and sharding.
|
|
|
|
To enable OpenSearch as a backend, the following settings must be set:
|
|
|
|
* `SEARCH_ENGINE_TYPE=open-search`
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ADDRESSES=http://YOUR-OPENSEARCH.URL:9200` (comma-separated list of OpenSearch addresses)
|
|
|
|
Additionally, the following optional settings can be set:
|
|
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_RESOURCE_INDEX_NAME=val` (default: `opencloud-resource`): Name of the OpenSearch index
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_USERNAME=val`: Username for HTTP Basic Authentication.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_PASSWORD=val`: Password for HTTP Basic Authentication.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_HEADER=val`: HTTP headers to include in requests.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_CA_CERT=val` CA certificate for TLS connections.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_RETRY_ON_STATUS=val` HTTP status codes that trigger a retry.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISABLE_RETRY=val` Disable retries on errors.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_RETRY_ON_TIMEOUT=val`: Enable retries on timeout.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_MAX_RETRIES=val`: Maximum number of retries for requests.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_COMPRESS_REQUEST_BODY=val`: Compress request bodies.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISCOVER_NODES_ON_START=val`: Discover nodes on service start.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISCOVER_NODES_INTERVAL=val`: Interval for discovering nodes.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_METRICS=val`: Enable metrics collection.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_DEBUG_LOGGER=val`: Enable debug logging.
|
|
* `SEARCH_ENGINE_OPEN_SEARCH_CLIENT_INSECURE=val`: Skip TLS certificate verification.
|
|
|
|
## Query language
|
|
|
|
By default, [KQL](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference) is used as the query language.
|
|
For an overview of how to write kql queries, please read the [microsoft documentation](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference).
|
|
|
|
Not all parts are supported, the following list gives an overview of parts that are not implemented yet:
|
|
|
|
* Synonym operators
|
|
* Inclusion and exclusion operators
|
|
* Dynamic ranking operator
|
|
* ONEAR operator
|
|
* NEAR operator
|
|
* Date intervals
|
|
|
|
In [this ADR](https://github.com/owncloud/ocis/blob/docs/ocis/adr/0020-file-search-query-language.md) you can read why KQL was chosen.
|
|
|
|
## Content analysis / Extraction
|
|
|
|
The search service supports the following content extraction methods:
|
|
|
|
* `Basic`: enabled by default, only provides metadata extraction.
|
|
* `Tika`: needs to be installed and configured separately, provides content extraction for many file types.
|
|
|
|
Note that the file content has to be transferred to the search service internally for content extraction,
|
|
which is resource-intensive and can lead to delays with larger documents.
|
|
|
|
### Basic
|
|
|
|
This extractor is the simplest one and just uses the resource information provided by OpenCloud.
|
|
It does not do any further content analysis.
|
|
|
|
### Tika
|
|
|
|
The main difference is that this extractor is able to analyze and extract data from more advanced file types like PDF, DOCX, PPTX, etc.
|
|
However, [Apache Tika](https://tika.apache.org/) is required for this task.
|
|
Read the [Getting Started with Apache Tika](https://tika.apache.org/2.6.0/gettingstarted.html) guide on how to install and run Tika or use a ready to run [Tika container](https://hub.docker.com/r/apache/tika).
|
|
See the [Tika container usage document](https://github.com/apache/tika-docker#usage) for a quickstart.
|
|
|
|
As soon as Tika is installed and configured, the search service needs to be told to use it.
|
|
|
|
The following settings must be set:
|
|
|
|
* `SEARCH_EXTRACTOR_TYPE=tika`
|
|
* `SEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL`
|
|
|
|
Additionally, the following optional settings can be set:
|
|
|
|
* `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS=true` (default: `true`): ignore stop words like `I`, `you`, `the` during content extraction.
|
|
|
|
## Manually Trigger Re-Indexing a Space
|
|
|
|
The service includes a command-line interface to trigger re-indexing a space:
|
|
|
|
```shell
|
|
opencloud search index --space $SPACE_ID
|
|
```
|
|
|
|
It can also be used to re-index all spaces:
|
|
|
|
```shell
|
|
opencloud search index --all-spaces
|
|
```
|
|
|
|
## Metrics
|
|
|
|
The search service exposes the following prometheus metrics at `<debug_endpoint>/metrics` (as configured using the `SEARCH_DEBUG_ADDR` env var):
|
|
|
|
| Metric Name | Type | Description | Labels |
|
|
| --- | --- | --- | --- |
|
|
| `opencloud_search_build_info` | Gauge | Build information | `version` |
|
|
| `opencloud_search_events_outstanding_acks` | Gauge | Number of outstanding acks for events | |
|
|
| `opencloud_search_events_unprocessed` | Gauge | Number of unprocessed events | |
|
|
| `opencloud_search_events_redelivered` | Gauge | Number of redelivered events | |
|
|
| `opencloud_search_search_duration_seconds` | Histogram | Duration of search operations in seconds | `status` |
|
|
| `opencloud_search_index_duration_seconds` | Histogram | Duration of indexing operations in seconds | `status` |
|