IBM Storage Ceph Positioned as the Ideal Foundation for Modern Data Lakehouses

T0@st · Feb 2, 2024

It's been one year since IBM integrated Red Hat storage product roadmaps and teams into IBM Storage. In that time, organizations have been faced with unprecedented data challenges to scale AI due to the rapid growth of data in more locations and formats, but with poorer quality. Helping clients combat this problem has meant modernizing their infrastructure with cutting-edge solutions as a part of their digital transformations. Largely, this involves delivering consistent application and data storage across on-premises and cloud environments. Also, crucially, this includes helping clients adopt cloud-native architectures to realize the benefits of public cloud like cost, speed, and elasticity. Formerly Red Hat Ceph—now IBM Storage Ceph—a state-of-the-art open-source software-defined storage platform, is a keystone in this effort.

Software-defined storage (SDS) has emerged as a transformative force when it comes to data management, offering a host of advantages over traditional legacy storage arrays including extreme flexibility and scalability that are well-suited to handle modern uses cases like generative AI. With IBM Storage Ceph, storage resources are abstracted from the underlying hardware, allowing for dynamic allocation and efficient utilization of data storage. This flexibility not only simplifies management but also enhances agility in adapting to evolving business needs and scaling compute and capacity as new workloads are introduced. This self-healing and self-managing platform is designed to deliver unified file, block, and object storage services at scale on industry standard hardware. Unified storage helps provide clients a bridge from legacy applications running on independent file or block storage to a common platform that includes those and object storage in a single appliance.

Ceph is optimized for large single and multisite deployments and can efficiently scale to support hundreds of petabytes of data and tens of billions of objects, which is key for traditional and newer generative AI workloads. The scalability, resiliency, and security of IBM Storage Ceph make it ideal to support data lakehouse and AI/ML open-source frameworks, in addition to more traditional workloads such as MySQL and MongoDB on Red Hat OpenShift or RedHat OpenStack. It's one reason why 768 TiB raw capacity of IBM Storage Ceph is included in watsonx.data, IBM's open, governed, fit-for-purpose data lakehouse architecture optimized for data, analytics, and AI workloads.

The Right-fit Foundation for Compute-Intensive and Data-Intensive Workloads
The explosive growth of unstructured data and generative AI share a symbiotic relationship, each influencing and benefiting the other. In its Top Trends in Enterprise Data Storage 2023 report, Gartner states that "by 2028, large enterprises will triple their unstructured data capacity across their on premises, edge and public cloud locations, compared to mid-2023." The proliferation of unstructured data, such as text, images, and videos, provides a vast and diverse source for training generative AI models. In turn, generative AI assists in making sense of and extracting valuable insights from the ever-expanding pool of unstructured data. This synergy results in a feedback loop where generative AI thrives on the abundance of unstructured data, and the continuous generation of realistic data by AI further enriches and refines your understanding of unstructured datasets, fostering innovation and advancements.

"By 2028, 70% of file and object data will be deployed on a consolidated unstructured data storage platform, up from 35% in early 2023," according to the same Gartner report. Organizations, therefore, need a storage management solution capable of accelerated data ingest, data cleansing and classification, metadata management and augmentation, and cloud-scale capacity management and deployment, such as software-defined storage. IBM Storage Ceph scales out seamlessly to meet these growing data demands. Its self-managing capabilities ensure that the system continuously adapts to constantly changing conditions, making the solution hassle-free while easily maintaining data integrity.

To accelerate and scale the impact of data and AI across an organization - and ultimately improve business outcomes - companies must be hybrid by design. This includes the ability to consume storage services on-prem with a cloud-native operating model to address issues such as the need for enterprise features sets unavailable on public cloud, data sovereignty considerations, and cost. The plug-and-play architecture of IBM Storage Ceph simplifies integration with existing infrastructures, including various platforms, cloud environments, hypervisors, open source data repositories like Apache Iceberg or Apache Parquet, and complete solution stacks such as watsonx.ai, watsonx.data, and others. New nodes or devices can be added to the cluster seamlessly, without having disruptions or service downtime. It delivers an easy and efficient way for clients to build a data lakehouse with watsonx.data and other next-generation AI workloads.

"At Snap, our requirement to store more and more data continues to expand, and we need a platform that can scale quickly, satisfy our performance KPIs, and be cost effective all at the same time. IBM Storage Ceph is the platform of choice with its simple scalable architecture, easy to manage interface, and cost-effective software-defined deployment. Having world-class expertise and support from IBM is another important part of our decision to use IBM Storage Ceph for such a critical component of our business." -- Snap Inc

Fast Data Access with NVMe over TCP
In the last year, IBM has introduced several important updates to Ceph, including, most recently, IBM Storage Ceph 7.0. This next generation Ceph platform prepares for NVMe/TCP capabilities which are designed to enable faster data transfer between storage devices, servers, and cloud platforms by retaining the low latency and high bandwidth characteristics of traditional NVMe. This makes it suitable for applications that demand ultra-fast storage access, such as databases, analytics, and content delivery, and it simplifies the infrastructure due to its compatibility with traditional network technology investments. These benefits will help clients adopt a software-defined approach designed to deliver a cloud-like experience in terms of speed, agility, and economics.

NVMe/TCP can help Ceph bridge the gap for traditional block storage with scale-out architectures. With NVMe/TCP, Ceph will be designed to integrate with platforms like VMware to help enterprises replicate cloud architectures in their own data center, moving away from expensive and rigid SAN networks and monolithic storage arrays.

Additional new features included in Ceph 7.0:

SEC and FINRA compliancy certification for WORM with object lock, enabling WORM compliance for object storage
NFS support for CephFS filesystem access for non-native Ceph clients
For more details on features, visit the IBM Storage community here

Cloud Economies of Scale with IBM Storage Ceph
Because IBM Storage Ceph stores data as objects within logical storage pools, a single cluster can have multiple pools, each tuned to different performance or capacity requirements. This allows clients to benefit from easier and faster access to data with content and context classifications, storage capacity limited only by the size of an organization's infrastructure, and cost reductions at scale by removing hardware restrictions compared to traditional and legacy storage array architectures.

Faster Time to Value
IBM has also made deployment for Ceph easier than ever before. With IBM Storage Ready Nodes for Ceph, the platform can be deployed as a complete software and hardware solution and comes in a variety of different capacity configurations optimized for running IBM Storage Ceph workloads. We've taken all the guesswork out of configuration, making it easier to digest, configure, and administer.

The growth of IBM Storage Ceph is just another example of how IBM's storage hardware and software portfolio helps provide faster time to value with scaled capacity and performance to optimize costs for clients.

View at TechPowerUp Main Site | Source

user556 · Feb 3, 2024

Cloud computing - How to get raped the most.

Ferrum Master · Feb 3, 2024

Prophet, we have a problem.

Philaphlous · Feb 3, 2024

In all my years in corporate and data analysis I've never heard it called a "data lakehouse"... everyone just refers to it as a "datalake"

Lomskij · Feb 3, 2024

Philaphlous said:
In all my years in corporate and data analysis I've never heard it called a "data lakehouse"... everyone just refers to it as a "datalake"

I think it's a rather cringy name for data lake + data warehouse. Considering that typical architecture would have a warehouse on top of the data lake to store the transformed data anyway, "lakehouse" seems to be just a marketing gimmick...

Dr. Dro · Feb 3, 2024

Ferrum Master said:
Prophet, we have a problem.

The same thing came to my mind when I read this last night lmao

Random_User · Feb 3, 2024

Dr. Dro said:
The same thing came to my mind when I read this last night lmao

Indeed. But also, for a moment it appeared for me as "Leakhouses". lol

Might sell the data to C.E.L.L. as well.

System Name	The TPU Typewriter
Processor	AMD Ryzen 5 5600 (non-X)
Motherboard	GIGABYTE B550M DS3H Micro ATX
Cooling	DeepCool AS500
Memory	Kingston Fury Renegade RGB 32 GB (2 x 16 GB) DDR4-3600 CL16
Video Card(s)	PowerColor Radeon RX 7800 XT 16 GB Hellhound OC
Storage	Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME SSD
Display(s)	Lenovo Legion Y27q-20 27" QHD IPS monitor
Case	GameMax Spark M-ATX (re-badged Jonsbo D30)
Audio Device(s)	FiiO K7 Desktop DAC/Amp + Philips Fidelio X3 headphones, or ARTTI T10 Planar IEMs
Power Supply	ADATA XPG CORE Reactor 650 W 80+ Gold ATX
Mouse	Roccat Kone Pro Air
Keyboard	Cooler Master MasterKeys Pro L
Software	Windows 10 64-bit Home Edition

System Name	HELLSTAR
Processor	AMD RYZEN 9 5950X
Motherboard	ASUS Strix X570-E
Cooling	2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory	4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s)	Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage	Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s)	Philips PHL BDM3270 + Acer XV242Y
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	SMSL RAW-MDA1 DAC
Power Supply	Fractal Design Newton R3 1000W
Mouse	Razer Basilisk
Keyboard	Razer BlackWidow V3 - Yellow Switch
Software	FEDORA 41

System Name	"Icy Resurrection"
Processor	13th Gen Intel Core i9-13900KS
Motherboard	ASUS ROG Maximus Z790 Apex Encore
Cooling	Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory	32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s)	NVIDIA RTX A2000 (5090 shipping to me soon™)
Storage	500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s)	55-inch LG G3 OLED
Case	Pichau Mancer CV500 White Edition
Audio Device(s)	Sony MDR-V7 connected through Apple USB-C
Power Supply	EVGA 1300 G2 1.3kW 80+ Gold
Mouse	Microsoft Classic IntelliMouse (2017)
Keyboard	IBM Model M type 1391405
Software	Windows 10 Pro 22H2
Benchmark Scores	I pulled a Qiqi~

System Name	Very old, but all I've got ®
Processor	So old, you don't wanna know... Really!

IBM Storage Ceph Positioned as the Ideal Foundation for Modern Data Lakehouses

T0@st

News Editor

user556

Ferrum Master

Philaphlous

Lomskij

Dr. Dro

Random_User