Generative AI Adoption

75%

By 2025

AI Features Delivered

125+

Q1 2025

FinGPT Fine-Tuning Cost

< $300

Per iteration

Real-Time High-Fidelity Data Infrastructure for LLMs in Financial Services

The financial industry is rapidly adopting large language models (LLMs) and AI agents to automate research, customer service, and trading. Recent surveys show that nearly three-quarters of wealth and asset managers are making “moderate to large investments” in generative AI in 2025 – up sharply from the prior yearinvestmentnews.com. In parallel, cloud-data leaders like Snowflake report thousands of weekly users of AI/ML tools on their platformainvest.com. This surge reflects a competitive imperative: firms hope to turn proprietary financial data into insights via LLMs, but doing so demands cutting-edge, real-time data infrastructure. Banks, hedge funds and data providers now require pipelines that continuously ingest market feeds, filings, news and internal records, cleanse and vectorize them, and deliver context to LLMs for both training and inference.

Challenges in Sourcing and Engineering Financial Data

Financial data presents unique hurdles. Structured data (e.g. time-series prices, transaction records, fundamentals) and unstructured text (news articles, filings, research reports, call transcripts) must be combined and normalized for LLM processing. The sheer volume and variety is overwhelming: market data streams, international accounting reports, and alternative data sources all require extensive cleaning and alignmentpacificdataintegrators.comarxiv.org. For example, building models on complex instruments (bonds, derivatives) often involves resolving corporate actions and disparate formatspacificdataintegrators.com. Regulatory compliance adds overhead: data must meet strict governance rules before use, further slowing pipelines. Importantly, financial markets move fast – using stale data quickly degrades model accuracy. As one industry analysis notes, ingesting “continuous streams of financial information” (tick data, breaking news, regulatory alerts) is critical to keep models up-to-datepacificdataintegrators.com. Failing to do so can lead to “inaccurate insights.”

A second challenge is data quality and availability. Many critical datasets are proprietary or behind paywalls. While filings and economic statistics are public, private trading and customer data are often siloed by business-unit or geography. In practice there is abundant public financial data but a scarcity of private, high-quality data for model trainingarxiv.org. Furthermore, much financial data is privacy-sensitive (credit scores, trading positions) and subject to strict rulesarxiv.org. Enterprises often resort to synthetic data generation to mitigate privacy, but synthetic sets may lack the real-world nuances and correlations found in actual marketsarxiv.org. This makes truly high-fidelity training difficult: models trained on limited or synthetic data may miss subtle patterns in real markets.

Even when data is available, fine-tuning LLMs for finance is expensive and slow. High-quality labeled examples (e.g. annotated financial documents or sentiment tags) are scarce. Research shows that “availability of high quality human-labeled data is critical” for fine-tuning, yet producing such labels in finance is hard due to regulations and specialized expertisearxiv.org. As a result, fine-tuning open-source models can cost tens of thousands of dollars per iteration. In response, open projects like FinGPT emphasize light-weight, incremental updates: FinGPT can fine-tune an LLM on new financial data for under $300, enabling monthly or weekly refreshesgithub.com. This contrasts with proprietary efforts like BloombergGPT, which required 53 days and $3M to train on a blend of financial datagithub.com. Nonetheless, the challenge remains: continual retraining on fast-moving financial data strains both engineering teams and budgets.

Innovations in Real-Time Pipelines, Vector Stores, and RAG

To meet these challenges, vendors and open-source projects are building new data infrastructure tailored for LLMs. Ingestion pipelines now often use streaming and lakehouse architectures (e.g. Apache Kafka/Fluentd feeds into Delta Lake or Snowflake Data Cloud), so that both tick data and document streams flow into a unified store. Modern pipelines apply semantic indexing as data arrives: text is chunked and embedded, then stored in a vector database (open-source or built-in) for fast retrieval. This enables Retrieval-Augmented Generation (RAG) workflows. RAG is a paradigm where the model first retrieves relevant documents or data from a corpus, then generates an answer conditioned on that contextdatabricks.com. In other words, instead of relying purely on parametric memory (which can hallucinate), the LLM is grounded in factual data extracted from the company’s own datasets or the market.

Figure 1: A typical RAG-based data architecture for financial LLMs. Structured and unstructured data (market feeds, filings, transcripts) flow into a unified storage layer. Precomputed embeddings populate a vector index for semantic search. At inference time, LLM prompts first query the vector store for relevant financial content, which is then used as context for generation. In finance, connecting real-time data sources (price ticks, news) with RAG retrieval ensures the model’s outputs reflect the latest market informationpacificdataintegrators.comshakudo.io.

In Figure 1, we illustrate how an LLM-powered app might work in finance. Real-time data (prices, trades, news) and static corpora (research reports, historical filings) are ingested into the data lake. A vector-search service maintains embeddings of all documents. During inference, a user question (e.g. “How did recent Fed minutes affect banking stocks?”) triggers a semantic query to retrieve relevant paragraphs. The LLM then generates a response conditioned on those retrieved passages. Leading platforms automate this entire flow. For example, Snowflake’s Cortex offers a “Cortex Search” feature for in-warehouse semantic search, plus serverless LLM functions to complete or summarize text on top of that datamedium.comsnowflake.com. Databricks’ Mosaic AI similarly provides managed RAG tooling: it integrates vector search over lakehouse data and LLM inference in one platformdatabricks.comtechcrunch.com. In practice, these innovations greatly reduce hallucinations and latency: embedding retrieval ensures up-to-date, high-fidelity data is attached to each query.

Other key innovations include vector databases and real-time feature stores. In the open-source world, projects like Weaviate, Milvus or Qdrant specialize in storing high-dimensional embeddings and supporting nearest-neighbor queries at scaledatacamp.com. Enterprises also add embedding indices to SQL/NoSQL databases: Snowflake and Databricks now offer built-in vector indexes as part of their AI stacks. Furthermore, modern MLOps tools track data drift and automate re-indexing. For instance, continuous monitoring services can detect when market regimes shift, triggering updates to the embedding database or retraining of models. These technologies create an “AI-ready” data fabric where new financial events can be quickly reflected in model inputs.

Enterprise Platforms vs. Open-Source Tools

Financial institutions face a choice between vendor solutions and open-source toolkits for their LLM infrastructure.

Snowflake Cortex and Data Cloud: Snowflake has extended its data warehouse into a full AI stack. Cortex provides built-in LLM functions (e.g. summarization, translation, sentiment) that run serverlessly on GPU-accelerated nodessnowflake.comsnowflake.com. Crucially, all data stays inside Snowflake’s governed environment – there is “no need to move data” to an external modelmedium.comsnowflake.com. Snowflake also offers semantic search over tables of text, enabling RAG end-to-end in SQL. New features (coming GA by mid-2025) include Cortex AISQL, which lets analysts embed generative prompts directly in SQL queriesainvest.comainvest.com, and SnowConvert AI tools to assist cloud migrations. With 125+ AI capabilities delivered in Q1 2025 aloneainvest.com, Snowflake aims to provide a one-stop “AI Data Cloud” where data ingestion, vectorization, and LLM inference are tightly integrated.
Databricks Mosaic AI and Lakehouse: Databricks has similarly combined data and AI. After acquiring MosaicML, it launched Mosaic AI – a suite of tools for LLM development and RAG. Mosaic AI includes a Vector Search service on Delta tables, managed Agent Frameworks for building multi-step LLM applications, and optimized pipelines for LLM fine-tuningtechcrunch.comdatabricks.com. In March 2025 Databricks announced a partnership with Anthropic: its Mosaic platform will natively host Anthropic’s Claude models, letting enterprises build private agents on their datadatabricks.com. The combined Databricks/Anthropic offering allows clients to run advanced LLMs (including Claude 3.7) directly in the Databricks environment, with end-to-end data governance and evaluation tools. Overall, Databricks emphasizes flexibility and governance: its platform is used by companies that “make multiple calls to a model or multiple models” and augment outputs with proprietary data for accuracy and safetytechcrunch.com. This reflects the reality that enterprise LLM deployments often chain together diverse models, tools, and retrieval systems in order to serve specialized business logic.
Open-Source Projects (FinRL, FinGPT, Hugging Face, etc.): On the open side, the finance community has created specialized toolkits. FinRL provides an open reinforcement-learning framework for trading and portfolio tasks (which can integrate LLM outputs as features). FinGPT is an open project that fine-tunes foundation LLMs on financial text; it emphasizes low-cost periodic updates to keep models currentgithub.com. More generally, the Hugging Face ecosystem offers models, datasets and inference tools that many institutions use. For example, Capital Fund Management (an asset manager) used Hugging Face’s Llama-3 models and Inference Endpoints to build a named-entity recognition (NER) system for financial news. This LLM-assisted pipeline improved NER F1 accuracy by 6.4% over earlier methods, while cutting costs by up to 80× via smaller modelshuggingface.cohuggingface.co. Such cases illustrate how off-the-shelf open models and APIs (e.g. Hugging Face Transformers pipelines, LangChain, or modular vector stores) can accelerate deployment. Compared to enterprise products, open-source stacks require more integration work but offer flexibility and rapid innovation. Firms often mix both approaches: for instance, an investment bank might run its data lake on Snowflake while using open-source Python libraries (LangChain, LlamaIndex) to orchestrate RAG calls to a Hugging Face or Anthropic model.

Use Cases: Hedge Funds, Banks, and Data Providers

Hedge Funds and Asset Managers. Quantitative funds are exploring LLMs for research and execution. For example, generative models can parse and summarize thousands of news articles or earnings transcripts to generate trading signals. Some hedge funds augment traditional quant strategies with LLM-based sentiment or risk models, retraining them daily on live market and news data. In algorithmic trading, Reinforcement Learning (FinRL) frameworks can even incorporate textual insights: e.g., an RL agent might use an LLM’s forecast of market-moving events as features in its trading policy. More prosaically, funds use LLM chatbots to let analysts query databases of internal research or fact-sheets in natural language. In all cases, low-latency data feeds and rigorous workflow automation are critical: a minute-old news flash could be the difference in a profitable trade.

Banks and Financial Services. Large banks deploy LLMs to improve customer and employee workflows. For instance, Deutsche Bank built a Vertex AI pipeline (using Google’s Gemini models) to automate document processing: it now processes thousands of internal documents daily with 97% accuracy, cutting handling time by 40%blog.google. Banks also use LLMs for compliance and risk analysis – for example, JPMorgan Chase developed “DocLLM,” an LLM tailored to interpret complex loan and lease documents by incorporating layout and domain knowledgeankursnewsletter.com. On the customer-facing side, institutions like CaixaBank are deploying generative chat assistants (using partners like Salesforce Agentforce AI) to guide users through onboarding and support, powered by domain-specific LLM agentsfintechfutures.com. Behind the scenes, core banking systems are being rehosted on AI-ready clouds: Lloyds Banking Group migrated dozens of ML and GenAI models to Google Cloud Vertex AI in Q1 2025, enabling rapid iteration of new AI servicesfintechfutures.com. In each of these cases, the common need is a secure, well-architected data pipeline: from transaction logs and client records to market data and knowledge bases – so that LLMs operate on the latest, highest-quality information.

Data Providers and Market Data Firms. Companies that sell financial data are also embedding AI capabilities into their platforms. Thomson Reuters and Bloomberg, for example, are exploring ways to offer LLM-powered analytics to clients (e.g. summarizing regulatory filings or explaining earnings trends). Open initiatives like FinGPT are even attempting to democratize these capabilities: FinGPT provides curated financial datasets and fine-tuned models so that smaller firms can build custom financial assistants without requiring privileged data accessgithub.com. Meanwhile, third-party platforms specialize in integrating market data into AI pipelines. For example, Shakudo offers a self-hosted “AI platform for funds” that connects to data sources like EDGAR and CapIQ and builds ETL pipelines (using Airbyte/Dagster/Prefect) into a vectorized RAG systemshakudo.io. This lets a hedge fund quickly spin up LLM agents that can answer questions over both public filings and proprietary research, all within a private environment.

Recent Partnerships, Launches and Deals

The past quarter has seen brisk activity in AI data infrastructure for finance. In March 2025, Databricks announced a five-year partnership with Anthropic: Claude 3.7 (their latest LLM) will be available natively on Databricks, integrated with the Mosaic AI stackdatabricks.com. This lets enterprises build Claude-based AI agents over their own data with Databricks’ governance and RAG tooling. In April 2025, Lloyds Banking Group and Google Cloud jointly showcased Lloyds’ transition of hundreds of models to Vertex AI – a move enabling large-scale AI workloads on the bank’s datafintechfutures.com. CaixaBank also announced a collaboration with Salesforce to deploy generative AI assistants in its apps and call centersfintechfutures.com. Within Snowflake’s ecosystem, an analyst report highlighted that Snowflake delivered 125+ new AI features in Q1 2025 (including the SnowConvert AI migration tool) and is expected to GA its Cortex AISQL (SQL-driven genAI) by the June Summitainvest.com.

While large partnerships grab headlines, fintech M&A remains active too. A recent industry analysis found that AI startups (including LLM and data intelligence vendors) are commanding very high valuation multiples (median EV/Revenue >25x) in 2025finrofca.com. Though not specific to finance, this underscores investor enthusiasm in any company helping enterprises harness generative AI – including the fintech data stack. (For example, in 2023 Databricks’ $1.3B acquisition of MosaicML signaled the value of specialized LLM training tools in the data spacetechcrunch.com.) On the regulatory front, exchanges and authorities are also preparing; for instance the UK’s Financial Conduct Authority is planning “live testing” of AI systems in finance, reflecting the industry push to operationalize these technologies.

Conclusion

In sum, financial firms face a new imperative: build real-time, high-fidelity data pipelines to feed LLMs safely and effectively. This requires combining best-of-breed data engineering (for streams, lakes and feature stores) with LLM-specific components (vector indices, retrieval layers, and prompt managers). Enterprise platforms like Snowflake Cortex and Databricks Mosaic AI are emerging to provide turnkey solutions under one roof, while open-source ecosystems (FinRL, FinGPT, Hugging Face, etc.) give teams flexibility to prototype and customize. Across hedge funds, banks and data vendors, early use cases (from algorithmic trading to document compliance to customer chatbots) are already demonstrating substantial ROI. The recent flurry of partnerships and product launches shows that incumbents and startups alike recognize the opportunity: the firms that can most rapidly integrate clean, live financial data with powerful LLMs will gain a distinct advantage in research, risk, and client service.

Sources: Industry reports and vendor press releases from Q4 2024–Q2 2025pacificdataintegrators.comarxiv.orgshakudo.iosnowflake.comdatabricks.comdatabricks.cominvestmentnews.comfintechfutures.com, among others, have been cited throughout. These include Snowflake and Databricks documentation, case studies, and news articles detailing the above developments.

引用

Gen AI gathers momentum as wealth firms scale digital plans - InvestmentNews

https://www.investmentnews.com/fintech/gen-ai-gathers-momentum-as-wealth-firms-scale-digital-plans/260667

Snowflake's AI Revolution: Why the Data Cloud Leader Is Set to Dominate Enterprise Innovation

https://www.ainvest.com/news/snowflake-ai-revolution-data-cloud-leader-set-dominate-enterprise-innovation-2506/

LLMs in Finance: Overcoming Constraints, Real-Time Data, and Affordable Solutions

https://www.pacificdataintegrators.com/blogs/llms-in-finance-overcoming-constraints-real-time-data

Opportunities and Challenges of Generative-AI in Finance

https://arxiv.org/html/2410.15653v1

LLMs in Finance: Overcoming Constraints, Real-Time Data, and Affordable Solutions

https://www.pacificdataintegrators.com/blogs/llms-in-finance-overcoming-constraints-real-time-data

Opportunities and Challenges of Generative-AI in Finance

https://arxiv.org/html/2410.15653v1

Opportunities and Challenges of Generative-AI in Finance

https://arxiv.org/html/2410.15653v1

Opportunities and Challenges of Generative-AI in Finance

https://arxiv.org/html/2410.15653v1

GitHub - AI4Finance-Foundation/FinGPT: FinGPT: Open-Source Financial Large Language Models! Revolutionize We release the trained model on HuggingFace.

https://github.com/AI4Finance-Foundation/FinGPT

Mosaic AI Agent Framework | Databricks

https://www.databricks.com/product/machine-learning/retrieval-augmented-generation

Bringing the Value of Self-Hosted LLMs to Financial Services

https://www.shakudo.io/industries/financial-services

Unlocking the Power of Snowflake Cortex and LLM Functions: Building RAG-Based Data Retrieval for Financial Applications | by Arjun Danda Sureshbabu | Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science | Apr, 2025 | Medium

https://medium.com/snowflake/unlocking-the-power-of-snowflake-cortex-and-llm-functions-building-rag-based-data-retrieval-for-f3d81305146b