Before building a new house or apartment building, residents and developers must first ensure their project adheres to local zoning rules.
But zoning documents are often difficult for both applicants and local government staff to navigate. This can lead to extended back and forth between staff and applicants, slowing down the permit process at a time when most communities don’t have enough housing.
Some local leaders have expressed interest to Urban researchers in the promise of using generative AI to make zoning codes easier to understand, whether through screener tools that scan permitting applications to ensure they are complete, or chatbots that can answer development questions. However, leaders are hesitant to adopt tools that haven’t been properly vetted for the quality of the information they produce.
To test how well generative AI tools could interpret zoning codes, we ran a benchmarking exercise that evaluated the capabilities of various large language models (LLMs), building on previous work that explored whether machine learning could help automate the collection of standardized zoning data. To do this, we developed a set of zoning and permitting queries for Minneapolis—a city with a complex, 467-page zoning code and with zoning processes familiar to our team.
Building a retrieval-augmented generation (RAG) system for zoning
With the city of Minneapolis’ zoning code as the source of truth, we needed a method for serving only the relevant parts of the zoning code to the LLMs to answer user questions. We wanted to avoid putting all 467 pages into an LLM’s context window, because larger inputs could be cost prohibitive for local governments trying to build applications at scale.
To serve only relevant parts, we used retrieval-augmented generation (RAG). This technique augments an LLM’s response to a user’s query by retrieving relevant information from the zoning code document that aids the LLM in generating a full response. It allows an LLM to access information that is outside the scope of its training data and helps ground its output in factual and pertinent information.
As part of the RAG framework, we needed to first partition the zoning code into smaller chunks that could be stored in a searchable database. For chunking, we used the open-source LangChain platform, which provides several tools for splitting text. We experimented with multiple approaches, including a recursive splitter that tries to find common separator characters and splitters from the Unstructured library, which can preserve aspects of document structure like headings, paragraphs, and tables.
To create this searchable database, we then generated embeddings—numeric representations of the zoning code represented in vector form—using the E5 family of open-source embedding models developed by Microsoft and available from HuggingFace. We stored the resulting embedding vectors in the open-source Facebook AI Similarity Search (FAISS) vector database, which allows the system to do an efficient similarity search and retrieve the text chunks most relevant to a user’s query.
Finally, we selected several LLMs to handle the queries themselves, including a mix of open-weight models such as Meta Llama 3.2 1B Instruct and Mistral Pixtral Large 25.02, lightweight models like OpenAI’s GPT-5 mini, and a more powerful model, ChatGPT 5.1. We accessed the Meta and Mistral models through Amazon Bedrock’s managed services, GPT-5 mini through OpenAI’s API, and ChatGPT 5.1 through the graphical user interface on chatgpt.com. ChatGPT 5.1 allows for larger context windows, so we supplied the entire zoning code in context to have a non-RAG approach for comparison (though this would not be a cost-sustainable approach for a functional local chatbot).
At this point, we had a ready-to-test system that could retrieve specific document chunks from the Minneapolis zoning code and use them to augment LLM’s responses to user queries.
But we still needed an evaluation method grounded in realistic scenarios local residents might present to genAI tools—something that would allow us to compare different technical configurations across different types of user questions.
Crafting an evaluation method
To make the benchmarking exercise realistic, we designed our evaluation across three dimensions: prompt audience, prompt complexity, and prompt repetition.
We defined the prompt audience by developing two user personas who might use a genAI-enabled zoning tool:
- User 1: A professional developer who wants to apply for a building permit for a new multi-family apartment building in Minneapolis.
- User 2: An owner of a single-family home in Minneapolis who wants to build an accessory dwelling unit (ADU) on their property.
We then developed a set of questions that each user might ask, recognizing that a professional developer is likely to be more familiar with zoning regulations than a typical homeowner.
This list included
- simple information extraction questions, like which zoning requirements apply to the user’s district,
- questions requiring comparison and synthesis, like assessing the legality of the user’s proposed new development, and
- questions requiring a specific format, like a checklist for a new building permitting application.
Importantly, we assumed user 1 and 2 knew which zoning district they wanted to build in.
Districts can be both complex and overlapping. Without access to a machine-readable zoning map, even LLMs with web search could not correctly answer what district an address was in. The interactive content on Minneapolis’ website made the information more accessible to human readers but was harder for the LLM to parse. For this reason, to ask more detailed questions about a given district, we had to supply the zoning district directly to the model in the prompt.
We defined various levels of prompt complexity for the same set of questions. Questions could be sparsely phrased using keywords similar to a Google search, simply phrased in a conversational tone, or more elaborately structured using prompt engineering. Finally, we iterated the same set of prompts for each user multiple times and tested the variations in LLM responses across runs.
The table below maps out a few examples broken down by user, prompt complexity, and task evaluated.
| Prompt | User Persona | Prompt Complexity | Question Type |
| “Summarize zoning ordinance for urban neighborhood 3 (UN3) zoning district in Minneapolis, MN.” | User 1: professional housing developer | Similar to a Google search | Summarization |
| “How is a single-family home located at [fictional address] zoned?” | User 2: single-family homeowner | Simply phrased in a conversational tone | Information extraction |
| “Aim: I am trying to fill a permit application for constructing a small multi-family building in urban neighborhood 2 (UN2) zoning district, Minneapolis, MN. Provide a checklist of all information and documentation I need to fill the permit application successfully per the zoning ordinance. Output: A checklist of requested information stated in the Aim.” | User 1: professional housing developer | Prompt-engineered | Request for response in a particular format |
For a deeper look, check out the full evaluation matrix and codebase for our benchmarking exercise.
Next, we created a system prompt, or a set of instructions that guides how the LLM generates responses. Our system prompt encouraged specific behaviors in the LLM responses and imposed instructional guardrails such as refraining from responding when the model deemed the retrieved context to be insufficient.
For each prompt in our repertoire, we asked an Urban Institute zoning expert familiar with Minneapolis zoning regulations to provide a ground-truth response.
We then compared the LLM responses we received against these ground-truth answers and manually assessed each response using the following criteria:
- Accuracy: Is the response to the user’s question factually correct?
- Relevance: Does the response address the user’s question directly?
- Reference to context: Does the RAG system pull the correct sections from the zoning code to answer the question?
- Consistency: Does the model give a substantively similar answer when asked a question multiple times?
- Confidence: Does the model capture uncertainty in its response when there is no clear answer or it does not have adequate information?
Each criterion was scored as fully met, partially met, or not met.
What We Found
We found the usability and accuracy of the LLM’s responses were heavily shaped by limitations in retrieval and the specificity of user queries.
Since we constrained responses to rely strictly upon retrieved context, poor retrieval performance directly resulted in noisy and ineffective responses. Zoning ordinances are inherently interconnected documents, with relevant information distributed across disparate sections, often making context retrieval challenging.
In cases where the relevant context was more narrowly constrained to a particular section, answers tended to be objective, specific, and closely aligned with exact keywords referenced in the question. Click on the box below to see an example of a question and response from our Mistral model that correctly covers primary uses, permitted structures, and even interaction with relevant overlay districts.
Question
“Aim: I am a professional developer and I want to build in urban neighborhood 3 (UN3) zoning district, Minneapolis, MN. I need to know everything relevant about building in this zoning district as it appears in the relevant zoning district and built form overlay district guidelines.
Output: Provide a summary for the Aim question along with references.”
Response
“Building in the Urban Neighborhood 3 (UN3) zoning district in Minneapolis, MN, involves specific regulations and guidelines tailored to residential and small-scale institutional or civic uses. Below is a summary of key points relevant to developing in this district, along with references to the zoning code:
- Primary Use: The UN3 district is designed to allow residential uses, small-scale institutional and civic uses, and select commercial uses through the adaptive reuse of existing structures (Article II, Section 530.100).
- Permitted Structures: Residential buildings with four (4) or more dwelling units are permitted on lots of record, provided they meet yard dimensions and other district requirements (excluding lot area or width) (Context 6).
- Institutional and Civic Uses:
- Community Centers: Allowed with a minimum lot size of 5,000 sq. ft. (Context 3).
- Educational Facilities: Permitted with a minimum lot size of 10,000 sq. ft. (Context 3).
- Parks and Public Open Spaces: Allowed with a minimum lot size of 20,000 sq. ft. (Context 3).
- Recreational Facilities: Permitted with a minimum lot size of 20,000 sq. ft. (Context 3).
- Compliance with Standards: All uses, including nonconforming ones, must maintain compliance with general performance standards, landscaping, screening, and curbing requirements applicable to the district (Section 545.1000).
- Built Form Overlay District: Properties in UN3 are also subject to regulations from a Built Form Overlay District. While the specific overlay district is not provided, these districts (e.g., BFI1, BFI2, BFC3) are described in Chapter 540 and may impose additional design and development standards (Section 530.20, Context 5).
- Conflicting Regulations: If there are conflicts between the primary zoning district (UN3) and the Built Form Overlay District, the overlay district regulations will govern (Section 540.30).
For detailed regulations, developers should consult Chapter 540 (Built Form Overlay Districts) and Chapter 550 (Development Standards) of the Minneapolis zoning code.”
We initially assumed that limiting RAG pipeline to only return a few chunks of context most semantically similar to the query would in turn lead to more precise responses. However, retrieving only limited chunks of the ordinance had both advantages and drawbacks.
On the one hand, we found that the context supplied to the LLM was generally correct and accurate. On the other hand, we found that completeness was lacking given the narrower focus and the right context being distributed across the document. Many relevant pieces of information never made it to the model, leading to cases where the context it did have was misapplied or misinterpreted toward an incorrect answer.
This challenge was compounded by subtle differences in zoning district names and semantically similar phrasing across sections, where keyword matches could be misleading (e.g., if the user asked about a specific zoning district called “Urban Neighborhood 3,” RAG might surface all occurrences of the keywords “urban neighborhoods”). Though this information wasn’t fundamentally incorrect, it did crowd out other text that was more relevant and specific to the user, place, or proposed development in the prompt.
We are currently working on RAG systems that experiment with the number of chunks returned to identify the tradeoffs of precise but limited context versus broader coverage of relevant context.
In our review of LLM responses, we found that unless the user was armed with procedural knowledge and could craft specific questions of the zoning ordinance, the responses were likely to be generic, hard to understand, and sometimes incorrect.
Take for example what happened when user 2 (our nonexpert persona) asked about the zoning constraints in a specific district, but didn’t specify the building requirements of interest.
We found the LLM responses were of limited utility and often missed vital context.
When asked about a “zoning requirement checklist for multi-family home,” our Mistral model reverted to guidance like, “Ensure that yard dimensions and all other requirements for the district are met, excluding lot area or lot width,” without specifying what those requirements are. Click on the box below to see the full question and response.
Question
“Zoning requirement checklist for multifamily home construction in urban neighborhood 3 (UN3) zoning district Minneapolis, MN”
Response
“Based on the provided context, here is a zoning requirement checklist for constructing a multi-family home in the Urban Neighborhood 3 (UN3) zoning district in Minneapolis, MN:
Zoning Requirement Checklist for Multi-Family Home Construction in UN3 District
-
Lot Requirements:
-
Lot of Record: The residential building with four or more dwelling units must be on a lot of record existing on the effective date of the ordinance.
-
Yard Dimensions: Ensure that yard dimensions and all other requirements for the district are met, excluding lot area or lot width.
-
Lot Merger: If the lot is nonconforming as to lot width or lot area and is located within the SH Shoreland Overlay District, it must be combined with adjacent lots under the same ownership to meet the full lot width and lot area requirements.
-
Maximum Lot Area: Lots nonconforming as to maximum lot area shall have all the rights of a conforming lot and shall not be enlarged except as allowed in Chapter 540, Built Form Overlay Districts, Article VIII, Lot Dimensions
-
- Floor Area Ratio (FAR):
- FAR for Multi-Family Dwellings: The FAR for multi-family dwellings is 1.4. Ensure the proposed building adheres to this ratio.
- Outdoor Children's Play Area:
- Requirement: If the development includes at least one dwelling with three or more bedrooms, an outdoor children's play area must be provided.
- Criteria: The lot area must be at least 20,000 square feet, located in the UN3 district, at least 1,500 feet from a public park, and outside the UA University Area Overlay District.
- Play Area Size: Minimum of 50 square feet for each unit containing three or more bedrooms, with a minimum of 300 square feet and a maximum of 2,000 square feet.
- Design: The play area must be secure, separated from parking and maneuvering areas, and designed to facilitate adult supervision.
- Affordable Housing:
- Inclusionary Housing: Comply with the city's inclusionary housing requirements by providing affordable units on-site.
- Unit Requirement: The development must include at least 20 dwelling units to qualify for the affordable housing bonus.
- Premium Value: The premium value shall be double if the affordable units are on-site and affordable at or below 30% of the area median income.
- Child Care Center (if included):
- Space Requirement: Not less than 2,000 square feet of enclosed space.
- Compliance: Meet the specific use standards for child care centers in Chapter 545, Use Regulations.
- Construction Type:
- Building Code: Not less than 60% of the floor area of the development must be within structures classified as Type IA, Type IB, Type IIA, or Type IV by the building code.
- Enclosed Parking:
- Location: All off-street parking must be within the building, entirely below grade, or in a parking garage of at least two levels.
- Drive Aisles: All drive aisles must be within the structure, except in the Interior 3 District, where drive aisles and vehicle maneuvering areas may be outdoors.
- Electric Vehicle Charging: Provide electric vehicle charging infrastructure as per section 555.320.
- Environmental Sustainability:
- Climate Resiliency: Achieve at least one of the following standards:
- Any performance standard (LEED, PHIUS, EGC, etc.) that meets the Minnesota Sustainable Building 2030 (SB 2030) 2010-2014 Energy Standard.
- Derive not less than 40% of electricity usage from renewable energy.
- Climate Resiliency: Achieve at least one of the following standards:
- Compliance with Standards:
- General Performance Standards: Maintain compliance with all general performance standards, landscaping, screening, and curbing requirements.
- Feasibility: The zoning administrator may require compliance with a portion of applicable standards if strict compliance is practically or economically infeasible.
- Site Plan Review:
- Design Standards: Comply with applicable regulations, including front entrance, window area, and walkway requirements.
- Points System: Obtain the minimum number of required site plan review points from Table 550-2, Single-, Two-, and Three-Family Dwellings.
This checklist summarizes the key zoning requirements for constructing a multi-family home in the UN3 district based on the provided context. Always consult the full zoning ordinance and local authorities for complete and detailed requirements.”
Because of our single question-answer benchmarking, the model didn’t ask users to provide information on any relevant overlay districts (i.e., sets of additional or updated rules layered on top of the rules for a base zoning district) that might apply to their address. These overlays affect requirements for things like floor area ratio or lot coverage. Having actual users test the model could surface more of these crucial inputs.
Looking Ahead
Without solving the retrieval problem, local governments would likely gain little by using genAI to pull information from long, dense policy and documentation. Even when we varied aspects of the RAG pipeline (such as document chunking, embedding, and the LLM used for analysis), we found few substantive gains in
As genAI models evolve to support even larger context windows, it may be possible to retain entire documents in context. However, this approach involves trade-offs. As we observed with ChatGPT 5.1, providing the entire zoning code as reference tended to lead to generic, meandering responses rather than precise ones, perhaps because finding needles in a 467-page haystack was simply too difficult.
Still, we are interested in iterating with methods that retrieve more or longer document chunks, and we are actively testing metadata tagging and filtering of these chunks, which may make retrieval much easier.
We also didn’t provide any instructions specific to Minneapolis’ zoning processes in the system prompts. Further information about the naming convention of zoning districts, Minneapolis’ use of both base and overlay districts, or other important local knowledge could help the LLM produce more complete, locally relevant answers.
We have also begun to test agentic models that can independently execute multistep processes. These could be better be able to jump across documents, parse metadata to extract the context they need, and then synthesize relevant information.
Still, we don’t expect them to magically solve the complexity of zoning ordinances unless more work is done on the AI-readiness of data. More investment in machine-readable, well-structured text data is needed to provide reliable information to AI models—both for RAG and agent-based approaches.
Much of the key information in a zoning code is in disparate places that could be better organized with strong metadata tagging, as well as organized within tables whose structures cannot be extracted in meaningful ways by traditional document chunking methods. We did not apply any preprocessing or tagging for this exercise to more realistically approximate the amount of setup that would need to be done by local governments. It is clear that this negatively impacted the results.
Even with a domain expert at our disposal, we also found evaluating the AI’s quality to be a subjective, time-consuming, and ad hoc process without proper frameworks and checks for inter-rater reliability—whether those raters are both humans or a human and an LLM.
Given that Urban routinely produces in-depth research and policy insights, we are continuing to explore comprehensive evaluation methods, tools, and frameworks for assessing where genAI tools can be useful, what pre-processing and analytical steps can make them more so, and how to measure this empirically and rigorously.
If you have thoughts or questions about this work, please reach out to Ridhi Purohit ([email protected]) or Judah Axelrod ([email protected]).