Querying embedded arrays in JSON (PostgreSQL JSONB and MongoDB documents)

When working with document-based data structures, the fields at the root can be indexed using simple expression indexes. However, when an array is present in the path, representing a One-to-Many relationship, PostgreSQL requires a GIN index and the use of JSON path operators for indexing, more efficient than SQL/JSON queries. Example I create the following table to store books. I decided to embed more information with a flexible schema and added a "data" column to store JSON data: create table books ( primary key(book_id), book_id bigint, title text, data jsonb ); I insert one book and add some reviews in my flexible schema document: insert into books values ( 8675309, 'Brave New World', '{ "reviews":[ { "name": "John", "text": "Amazing!" }, { "name": "Jane", "text": "Incredible book!" } ] }' ); There’s no need for another table, as reviews are inherently linked to the books they discuss. A book cannot be reviewed without being displayed alongside its review, making any separate table unnecessary. I know it looks like violating the first normal form, but there's no update anomaly possible here because there's no duplication. From a normalization point of view, this is not very different from storing text, which is an array of char, or embeddings, which are arrays of numbers. Inefficient query with SQL join If you're comfortable with SQL, you might want to query this structure using SQL. Simply unnest the JSON document array and use it like a relational table: SELECT DISTINCT title FROM books JOIN LATERAL jsonb_array_elements(books.data->'reviews') AS review ON review->>'name' = 'John' ; jsonb_array_elements expands a JSON array into rows for SQL queries. The lateral join adds book information, the ON or WHERE clause filters by reviewer name, and DISTINCT removes duplicate titles. This is standard SQL syntax but cannot use an index to filter on the reviewer name before unnesting, requiring a read of all rows and documents: QUERY PLAN ----------------------------------------------------------------------- Unique -> Sort Sort Key: books.title -> Nested Loop -> Seq Scan on books -> Function Scan on jsonb_array_elements review Filter: ((value ->> 'name'::text) = 'John'::text) While this is a valid SQL syntax, and JSON is a valid SQL datatype, they are not so friendly because a relational database is not a document database. When using documents in PostgreSQL, you must learn how to query them and index them. Note that jsonb_array_elements is not SQL standard, but PostgreSQL 17 introduced the JSON_TABLE which is aprt of the standard. The query can be re-written as: SELECT books.title FROM books JOIN JSON_TABLE( books.data->'reviews', '$[*]' COLUMNS ( name TEXT PATH '$.name' ) ) AS review ON review.name = 'John' ; This is the standard SQL/JSON way to query documents. Unfortunately, it is not efficient as no index scan is possible. Don't forget that SQL indexes are not part of the SQL standard. Efficient query with JSON operators To efficiently query JSONB data for reviews by a specific person, we need to utilize PostgreSQL's containment operator @> instead of relying on standard SQL: SELECT title FROM books WHERE data->'reviews' @> '[{"name": "John"}]' ; Now that I filter directly on the table without transforming the document, I can create an index. Since there can be multiple keys per table row, an inverted index is necessary: CREATE INDEX ON books USING gin ((data->'reviews') jsonb_path_ops) ; With an index for the JSON path operators, each key corresponds to an item in the array. This can be utilized when querying with an equality filter on the embedded array field: QUERY PLAN -------------------------------------------------------------------------------- Bitmap Heap Scan on books Recheck Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb) -> Bitmap Index Scan on books_expr_idx Index Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb) GIN (Generalized Inverted Index) is designed for datatypes that includes multiple keys, such as array items or words, stems, or trigrams in text. While powerful, GIN has limitations: it cannot support range queries, optimize ORDER BY clauses, or perform covering projections (no Index Only Scan). Comparison with a document database While PostgreSQL offers flexibility in storing and indexing JSON documents, it does not replace a document database where documents are native types. For instance, in MongoDB the fields within an array are used like any other fields. I insert similar document in MongoDB: db.books.insertOne({

Apr 30, 2025 - 23:07
 0
Querying embedded arrays in JSON (PostgreSQL JSONB and MongoDB documents)

When working with document-based data structures, the fields at the root can be indexed using simple expression indexes. However, when an array is present in the path, representing a One-to-Many relationship, PostgreSQL requires a GIN index and the use of JSON path operators for indexing, more efficient than SQL/JSON queries.

Example

I create the following table to store books. I decided to embed more information with a flexible schema and added a "data" column to store JSON data:

create table books (  
   primary key(book_id),  
   book_id bigint,  
   title text,  
   data jsonb  
);

I insert one book and add some reviews in my flexible schema document:

insert into books values (  
   8675309,  
   'Brave New World',  
   '{ "reviews":[  
      { "name": "John", "text": "Amazing!" },  
      { "name": "Jane", "text": "Incredible book!" }  
   ] }'
);

There’s no need for another table, as reviews are inherently linked to the books they discuss. A book cannot be reviewed without being displayed alongside its review, making any separate table unnecessary. I know it looks like violating the first normal form, but there's no update anomaly possible here because there's no duplication. From a normalization point of view, this is not very different from storing text, which is an array of char, or embeddings, which are arrays of numbers.

Inefficient query with SQL join

If you're comfortable with SQL, you might want to query this structure using SQL. Simply unnest the JSON document array and use it like a relational table:

SELECT DISTINCT title FROM books    
JOIN LATERAL jsonb_array_elements(books.data->'reviews') AS review 
ON review->>'name' = 'John'
;  

jsonb_array_elements expands a JSON array into rows for SQL queries. The lateral join adds book information, the ON or WHERE clause filters by reviewer name, and DISTINCT removes duplicate titles. This is standard SQL syntax but cannot use an index to filter on the reviewer name before unnesting, requiring a read of all rows and documents:

                              QUERY PLAN                               
-----------------------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: books.title
         ->  Nested Loop
               ->  Seq Scan on books
               ->  Function Scan on jsonb_array_elements review
                     Filter: ((value ->> 'name'::text) = 'John'::text)

While this is a valid SQL syntax, and JSON is a valid SQL datatype, they are not so friendly because a relational database is not a document database. When using documents in PostgreSQL, you must learn how to query them and index them.

Note that jsonb_array_elements is not SQL standard, but PostgreSQL 17 introduced the JSON_TABLE which is aprt of the standard. The query can be re-written as:

SELECT books.title  
FROM books  
JOIN JSON_TABLE(  
  books.data->'reviews',  
  '$[*]' COLUMNS (  
    name TEXT PATH '$.name' 
  )  
) AS review  
ON review.name = 'John'
;  

This is the standard SQL/JSON way to query documents. Unfortunately, it is not efficient as no index scan is possible. Don't forget that SQL indexes are not part of the SQL standard.

Efficient query with JSON operators

To efficiently query JSONB data for reviews by a specific person, we need to utilize PostgreSQL's containment operator @> instead of relying on standard SQL:

SELECT title FROM books  
WHERE data->'reviews' @> '[{"name": "John"}]'
;

Now that I filter directly on the table without transforming the document, I can create an index. Since there can be multiple keys per table row, an inverted index is necessary:

CREATE INDEX ON books USING gin ((data->'reviews') jsonb_path_ops)
;

With an index for the JSON path operators, each key corresponds to an item in the array. This can be utilized when querying with an equality filter on the embedded array field:

                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on books
   Recheck Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb)
   ->  Bitmap Index Scan on books_expr_idx
         Index Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb)

GIN (Generalized Inverted Index) is designed for datatypes that includes multiple keys, such as array items or words, stems, or trigrams in text. While powerful, GIN has limitations: it cannot support range queries, optimize ORDER BY clauses, or perform covering projections (no Index Only Scan).

Comparison with a document database

While PostgreSQL offers flexibility in storing and indexing JSON documents, it does not replace a document database where documents are native types. For instance, in MongoDB the fields within an array are used like any other fields. I insert similar document in MongoDB:

db.books.insertOne({  
  book_id: 8675309,  
  title: "Brave New World",  
  reviews: [  
    { name: "John", text: "Amazing!" },  
    { name: "Jane", text: "Incredible book!" }  
  ]  
}); 

There is no need for special operators, and I can query the embedded field like any other field:

db.books.find(
  { "reviews.name": "John" } // filter
, { title: 1, _id: 0 }      // projection
);  

[ { title: 'Brave New World' } ]

There is no need for special index type, and I can index the embedded field like any other field:

db.books.createIndex({ "reviews.name": 1 })
;  

The execution plan confirms that the index is used to filter on "reviews.name":

db.books.find(
  { "reviews.name": "John" } // filter
, { title: 1, _id: 0 }      // projection
).explain().queryPlanner.winningPlan
; 

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { title: 1, _id: 0 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'reviews.name': 1 },
      indexName: 'reviews.name_1',
      isMultiKey: true,
      multiKeyPaths: { 'reviews.name': [ 'reviews' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { 'reviews.name': [ '["John", "John"]' ] }
    }
  }
}

It is a regular index, the only particularity is that it allows multi-key entries.

Unlike PostgreSQL's GIN index, which requires a Bitmap Scan that doesn't maintain entry order, MongoDB employs a regular index that supports range queries. For instance, if I only know the beginning of a name, I can utilize a Regular Expression to filter the results effectively:

db.books.find(  
  { "reviews.name": { $regex: "^Joh" } }, // filter using regex  
  { title: 1, _id: 0 }                    // projection  
).explain().queryPlanner.winningPlan
;  

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { title: 1, _id: 0 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'reviews.name': 1 },
      indexName: 'reviews.name_1',
      isMultiKey: true,
      multiKeyPaths: { 'reviews.name': [ 'reviews' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { 'reviews.name': [ '["Joh", "Joi")', '[/^Joh/, /^Joh/]' ] }
    }
  }
}

MongoDB utilized the index efficiently, as the query planner transformed the regular expression /^Joh/ into a range scan, specifically ["Joh", "Joi").

Conclusion

When comparing PostgreSQL and MongoDB, it is essential to understand their querying and indexing mechanisms, and not rely only on their ability to store JSON.
Like other RDBMS, PostgreSQL excels as a centralized, monolithic database utilized by multiple applications. With its specialized JSONB functions and GIN indexes, it adds some flexibility to the normalized tables.
MongoDB is ideal for development agility, particularly in microservices and domain-driven design, where access patterns are well-defined, but the application evolves with high velocity. Its document model aligns well with business objects.

Ultimately, the choice of a database should be based on your team's expertise, comfort with database syntax, data modeling, optimal indexing, and access to new hires and educational resources. The best database for a specific workload will not perform as expected if there's no expertise to code efficient queries, read execution plans, and index the access paths. Another database may be good enough when it fits better with the development organization and provides a better developer experience and simplifies the optimization.