Sample dataset
Now that we've gotten a glimpse of the basics, let's try to work on a more realistic dataset. I've prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following schema:
{
"firstname": "Ashley",
"lastname": "Clark",
"age": 24,
"gender": "female",
"phone": "+1 (964) 525-3462",
"company": "Pearlessa",
"address": "477 Branton Street",
"city": "Trexlertown",
"state": "Indiana",
"email": "ashley.clark@pearlessa.co.uk",
"eyeColor": "green",
"favoriteFruit": "strawberry",
"account_number": 360070,
"balance": "2550.21"
}
For the curious, this data was generated using JSON Generator, so please ignore the actual values and semantics of the data as these are all randomly generated.
Loading the Sample Dataset
You can download the sample dataset{:target="_blank"}. Extract it to our current directory and let's load it into our cluster as follows:
{% capture req %}
POST /bank/:restore?pretty
Content-Type: application/json
@accounts.json
{% endcapture %} {% include curl.html req=req %}
And then you can use :info
to get information about the new index:
{% capture req %}
GET /bank/:info?pretty
{% endcapture %} {% include curl.html req=req %}
Response should be something like:
{
"#database_info": {
"#uuid": "923a4470-7cdc-45ec-827c-fa85703fa8f6",
"#doc_count": 1000,
"#last_id": 1000,
"#doc_del": 0,
"#av_length": 22.432,
"#doc_len_lower": 22,
"#doc_len_upper": 25,
"#has_positions": false
}
}
Which means that we just successfully bulk indexed 1000 documents into the bank index.
The Search API
Now let's start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We'll try one example of the request URI method but for the remainder of this guide, we will exclusively be using the request body method.
The REST API for search is accessible from the :search
endpoint. This example
returns all documents in the bank index:
{% capture req %}
GET /bank/:search?q=*&sort=account_number&pretty
{% endcapture %} {% include curl.html req=req %}
Let's first dissect the search call. We are searching (:search
endpoint) in
the bank
index, and the q=*
parameter instructs Xapiand to match all
documents in the index. The sort=account_number
parameter indicates to
sort the results using the account_number
field of each document in an
ascending order. The pretty
parameter just tells Xapiand to return
pretty-printed JSON results, the same effect can be achieved by using the
Accept
header as in: Accept: application/json; indent: 4
.
And the response (partially shown):
{
"#query": {
"#total_count": 10,
"#matches_estimated": 1000,
"#hits": [
{
"city": "Fairview",
"gender": "female",
"balance": "1073.05",
"firstname": "Hester",
"lastname": "Blake",
"company": "Affluex",
"favoriteFruit": "strawberry",
"eyeColor": "brown",
"phone": "+1 (919) 400-3616",
"state": "Virgin Islands",
"account_number": 100123,
"address": "756 Strauss Street",
"age": 24,
"email": "hester.blake@affluex.net",
"_id": 233,
"#docid": 233,
"#rank": 0,
"#weight": 0.0,
"#percent": 100
}, ...
]
},
"#took": 21.49
}
As for the response, we see the following parts:
#query ➛ #total_count
- Total number of returned hits.#query ➛ #matches_estimated
- Number of estimated documents that match the query.#query ➛ #hits
- search results.#took
- time in milliseconds for Xapiand to execute the search.
Introducing the Query Language
Xapiand provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.
{: .note} The Query DSL method for searching is much more efficient.
Going back to our last example, we executed a query to retrieve all documents
using q=*
. Here is the same exact search using the alternative request body
method:
{% capture req %}
GET /bank/:search?pretty
{
"_query": "*",
"_sort": "account_number"
}
{% endcapture %} {% include curl.html req=req %}
The difference here is that instead of passing q=*
in the URI, we POST a
JSON-style query request body to the :search
API.
Dissecting the above, the query part tells us what our query definition is and
the match_all
part is simply the type of query that we want to run. The
match_all
query is simply a search for all documents in the specified index.
In addition to the query parameter, we also can pass other parameters to
influence the search results. In the example in the section above we passed in
sort, here we pass in limit
:
{% capture req %}
GET /bank/:search?pretty
{
"_query": "*",
"_limit": 1
}
{% endcapture %} {% include curl.html req=req %}
Note that if limit
is not specified, it defaults to 10.
This example does a match_all
and returns documents 10 through 19:
{% capture req %}
GET /bank/:search?pretty
{
"_query": "*",
"_offset": 10,
"_limit": 10
}
{% endcapture %} {% include curl.html req=req %}
The offset
parameter (0-based) specifies which document index to start from
and the limit
parameter specifies how many documents to return starting at the
given offset
. This feature is useful when implementing paging of search
results. Note that if offset
is not specified, it defaults to 0.
This example does a match_all
and sorts the results by account balance in
descending order and returns the top 10 (default limit
) documents.
{% capture req %}
GET /bank/:search?pretty
{
"_query": "*",
"_sort": { "balance": { "_order": "desc" } }
}
{% endcapture %} {% include curl.html req=req %}
Executing Searches
Now that we have seen a few of the basic search parameters, let's dig in some more into the Query DSL. Let's first take a look at the returned document fields. By default, the full JSON document is returned as part of all searches. This is referred to as the source (_source field in the search hits). If we don't want the entire source document returned, we have the ability to request only a few fields from within source to be returned.
This example shows how to return two fields, account_number
and balance
(inside of _source), from the search:
{% capture req %}
GET /bank/:search?pretty
{
"_query": "*",
"_source": ["account_number", "balance"]
}
{% endcapture %} {% include curl.html req=req %}
{: .note .unreleased} TODO: Work in progress...
Executing Filters
{% capture req %}
GET /bank/:search?pretty
{
"_query": {
"bool": {
"must": "*",
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
{% endcapture %} {% include curl.html req=req %}
{: .note .unreleased} TODO: Work in progress...
Executing Aggregations
Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Xapiand, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.
To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):
{% capture req %}
GET /bank/:search?pretty
{
"_limit": 0,
"_aggs": {
"_group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}
{% endcapture %} {% include curl.html req=req %}
In SQL, the above aggregation is similar in concept to:
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
And the response (partially shown):
{
...
}
{: .note .unreleased} TODO: Work in progress...
There are many other aggregations capabilities that we won't go into detail here. The Aggregations Reference Guide is a great starting point if you want to do further experimentation.