Which open-source AI models are good for urbanism tasks and smart cities?

In this research, we tested different Large Language models on urban-related questions dataset. AI assistants and models are actively used by city dwellers to check information about city services and check routes (tourists) and by those who developed cities, such as architects, urbanists, environmental activists, politicians, and municipal government employees. Urban MCQ (Multiple-Choice Question) Benchmarks include different types of questions related to urban studies, urban design, public spaces architecture, etc. Questions related to the best urban practice were collected based on well-known research like "World Cities Report 2024", books and publicly available information. Here is how the questions look like: How does the "network" differ from the "polynuclear field" in urban planning? There are also questions about specific places like: What is a key feature of the Piazza del Campo in Siena that makes it a successful public space? { "id": 5, "question": "What type of urban space fosters social interaction?", "choices": [ "Narrow alleys", "Pedestrian plazas", "Underground malls", "Expressways" ], "answerIndex": 1 }, You can find the questions dataset and LLM answers logs in this repository. Open LLMs/AI models Performance Comparison of multiple LLMs on 101 domain-specific questions related to planning, design, and smart cities Open source is significant for cities since it allows using related projects/models without limitations. City IT departments or contractors can deploy models on city infrastructure or in the certified cloud and are not subject to commercial limitations. In the list below, only one model published its wights under an open-source license: Mixtral 8x22B. Mixtral 8x22B is developed by Mistral AI. Model weights are released under the Apache 2.0 license. This is a permissive license that allows users to use, modify, and distribute the model, even for commercial purposes, as long as they include the original copyright notice. Llama-4 Maverick 17B 128E is part of Meta's Llama model family. Meta releases Llama models under a custom license that is more restrictive than open-source licenses. While it allows for research and some commercial use, it prohibits specific applications and requires users to register before accessing the weights. Cohere Command R+ uses the CC BY-NC license (Creative Commons Attribution-NonCommercial). This license allows others to use, share, and adapt the work, but only for non-commercial purposes. Commercial use requires additional licensing arrangements directly with Cohere. Google Gemma-3 27B is part of Google's Gemma family of models. Google released these models under the Gemma license, which is based on Apache 2.0 but with additional terms. The license allows commercial use with certain restrictions, particularly around safety, illegal content, and competing products. Results for proprietary models with closed AI model weight. Key Results for proprietary models: Claude 3.7 Sonnet leads with highest score: 96.04% GPT-4.1 follows closely behind: 95.05% Gemini 2.0 Flash, Cohere commandA, and Mistral Large trail with scores of 90.1%, 89.11%, and 88.12% respectively All LLMs performed strongly with scores above 88% Notable performance gap (6-8%) between the top two models and the rest There are 101 questions in this dataset, but we try to select and create different types of questions. If you are an architect, urbanist, or environmental activist working with cities or have any ideas for urban-related questions and datasets, please contribute to the project or reach out to me.

May 16, 2025 - 12:26
 0
Which open-source AI models are good for urbanism tasks and smart cities?

In this research, we tested different Large Language models on urban-related questions dataset.

AI assistants and models are actively used by city dwellers to check information about city services and check routes (tourists) and by those who developed cities, such as architects, urbanists, environmental activists, politicians, and municipal government employees.

Urban MCQ (Multiple-Choice Question) Benchmarks include different types of questions related to urban studies, urban design, public spaces architecture, etc.

Questions related to the best urban practice were collected based on well-known research like "World Cities Report 2024", books and publicly available information.

Here is how the questions look like:

How does the "network" differ from the "polynuclear field" in urban planning?

There are also questions about specific places like:

What is a key feature of the Piazza del Campo in Siena that makes it a successful public space?

  {
    "id": 5,
    "question": "What type of urban space fosters social interaction?",
    "choices": [
      "Narrow alleys",
      "Pedestrian plazas",
      "Underground malls",
      "Expressways"
    ],
    "answerIndex": 1
  },

You can find the questions dataset and LLM answers logs in this repository.

Open LLMs/AI models Performance

Open LLMs/AI models Performance

Comparison of multiple LLMs on 101 domain-specific questions related to planning, design, and smart cities

Open source is significant for cities since it allows using related projects/models without limitations. City IT departments or contractors can deploy models on city infrastructure or in the certified cloud and are not subject to commercial limitations. In the list below, only one model published its wights under an open-source license: Mixtral 8x22B.

Mixtral 8x22B is developed by Mistral AI. Model weights are released under the Apache 2.0 license. This is a permissive license that allows users to use, modify, and distribute the model, even for commercial purposes, as long as they include the original copyright notice.

Llama-4 Maverick 17B 128E is part of Meta's Llama model family. Meta releases Llama models under a custom license that is more restrictive than open-source licenses. While it allows for research and some commercial use, it prohibits specific applications and requires users to register before accessing the weights.

Cohere Command R+ uses the CC BY-NC license (Creative Commons Attribution-NonCommercial). This license allows others to use, share, and adapt the work, but only for non-commercial purposes. Commercial use requires additional licensing arrangements directly with Cohere.

Google Gemma-3 27B is part of Google's Gemma family of models. Google released these models under the Gemma license, which is based on Apache 2.0 but with additional terms. The license allows commercial use with certain restrictions, particularly around safety, illegal content, and competing products.

Results for proprietary models with closed AI model weight.

Image description

Key Results for proprietary models:

  • Claude 3.7 Sonnet leads with highest score: 96.04%
  • GPT-4.1 follows closely behind: 95.05%
  • Gemini 2.0 Flash, Cohere commandA, and Mistral Large trail with scores of 90.1%, 89.11%, and 88.12% respectively
  • All LLMs performed strongly with scores above 88%
  • Notable performance gap (6-8%) between the top two models and the rest

There are 101 questions in this dataset, but we try to select and create different types of questions.

If you are an architect, urbanist, or environmental activist working with cities or have any ideas for urban-related questions and datasets, please contribute to the project or reach out to me.