File System Primer
Contents[hide] |
Linux offers a number of file systems. This paper discusses thesefile systems, why there are so many, and which ones are the best to usefor which workloads and data.Not all data is the same. Not all workloads are the same. Not allfilesystems are the same. Matching the file system to the data andworkload allows customers to build efficient scalable and costeffective solutions. The next section of this document describes fourgeneral workload areas. It is important to understand these differentworkloads and their requirements, as these drive requirements into filesystems. This will also serve as a guide in comparing and contrastingthe various file systems available in the market today.
IT organizations typically divide workloads into four areas:
It is important to understand the difference between File Systemsand File Access Protocols. Both apply to the general concept of "FileSystems", but for the purposes of this document, the distinction is assuch:
File Systems: Control the organization of data onstorage media. File System software can be viewed as a filing cabinetwhich provides a structured container into which data is organized andstored. File Systems do NOT include File Access Protocols.
File Access Protocols: Control the semantics of allowingremote network access to data stored in file systems. File AccessProtocols typically have dependencies on File System features (there isa match between File System Semantics and File Protocol Semantics.)
It is extremely important to understand the priority of needsbetween each of these general workloads, as this drives therequirements for High Availability, File systems, File Access, andVolume Management Storage throughout the IT organization.HA File system, File Access and Storage requirements per workload:
There are three main reasons why there are so many File Systems on Linux:
Open source means anyone can contribute their value, and they have.This has made available about 20 different file systems for Linux.Ranging from very rudimentary simple file systems to extremely complexand rich file systems.As storage needs have grown, there has been the need for increasingscalability in file systems. This second reason for so many has led tofile systems which claim to run faster, handle more files, scale tolarger volumes, and can handle more concurrent access to data.Lastly, as mainframe and mini computer systems have given way to lessexpensive Intel Architecture based commodity PC servers running Linuxas well as moving from non-Linux PC operating systems to Linux, theneed to preserve access to existing data that was stored on those othersystems has resulted in additional file systems which understand thatdata and storage.
The following list describes the Linux file system characteristicsand indicates when this file system is best used. This list is notexhaustive of all the file systems available in the world, but focuseson those which have appreciable market share or attention in the markettoday. A detailed comparison of file system features can be found at:http://en.wikipedia.org/wiki/Comparison_of_file_systems and Linux Data Management and High Availability Features
EXT2
EXT2 file system is the predecessor to the EXT3 file system. EXT2 isnot journaled, and hence is not recommended any longer (customersshould move to EXT3).
EXT3
EXT3 file system is a journaled file system that has the greatestuse in Linux today. It is the "Linux" File system. It is quite robustand quick, although it does not scale well to large volumes nor a greatnumber of files. Recently a scalability feature was added calledhtrees, which significantly improved EXT3's scalability. However it isstill not as scalable as some of the other file systems listed evenwith htrees. It scales similar to NTFS with htrees. Without htrees,EXT3 does not handle more than about 5,000 files in a directory.
FAT32
FAT32 is the crudest of the file systems listed. It's popularity iswith its widespread use and popularity in the Windows desktop world andthat it has made its way into being the file system in flash RAMdevices (digital cameras, USB memory sticks, etc.). It has no built insecurity access control, so is small and works well in these portableand embedded applications. It scales the least of the file systemslisted. Most systems have FAT32 compatibility support due to itsubiquity.
GFS
The RedHat Global File System (Sistina acquisition) was open sourcedin mid 2004. It is a parallel cluster file system (symmetrical) whichallows multiple machines to access common data on a SAN (Storage AreaNetwork). This is important for allowing multiple machines access tothe same data to ease management (such as common configuration filesbetween multiple webservers). It also allows applications and serviceswhich are written to direct disk access to be scaled out to multiplenodes. The practical limit is 16 machines in a SAN cluster, however.
GPFS
The IBM Global Parallel File System is from IBM. It, like GFS, is aparallel cluster file system with similar characteristics to GFS. Videoediting is the sweet spot for GPFS. GPFS supports from 2 to thousandsof nodes in a single cluster. GPFS also includes very rich managementfeatures, such as Hierarchical Storage Management.
JFS
The IBM Journaled File System is the file system used by IBM in AIXand OS/2. It is a feature rich file system ported to Linux to allow forease of migration of existing data. It has been shown to provideexcellent overall performance across a variety of workloads.
NSS
The Novell Storage Services file system used in NetWare 5.0 andabove, and most recently open sourced and included in Novell SUSE'sSLES 9 SP1 Linux distribution and later (used in Novell's OpenEnterprise Server Linux product). The NSS file system is unique in manyways, mostly in its ability to manage and support shared file servicesfrom simultaneous different file access protocols. It is designed tomanage access control (using a unique model, called the Trustee Model,that scales to hundreds of thousands of different users accessing thesame storage securely) in enterprise file sharing environments. It andits predecessor (NWFS) are the only file systems that can restrict thevisibility of the directory tree based on UserID accessing the filesystem. It and NWFS have built-in ACL rights inheritance. It includesmature and robust features tailored for the file sharing environment ofthe largest enterprises. The file system also scales to millions offiles in a single directory. NSS supports multiple data streams andrich metadata (its features are a superset of existing filesystems onthe market for data stream, metadata, namespace, and attributesupport).
NTFS
The Microsoft Windows file system for the Windows NT kernel (WindowsNT, Windows 2000, Windows XP, and Windows 2003). The Linux OpenSourceversion of this filesystem is only capable of read-only of existingNTFS data. This allows for migration from Windows and access to Windowsdisks. NTFS includes an ACL model which is not POSIX. The NTFS ACLmodel is unique to Microsoft, but is a derivative of the Novell NetWare2.x ACL model. NTFS is the default (and virtually only option) onWindows servers. It includes rich metadata and attribute features. NTFSalso supports multiple data streams and ACL rights inheritance sinceits Windows 2000 implementation. In Windows 2003 R2, Microsoft includeda feature called "Access Based Enumeration". This is similar tovisibility in NSS and NWFS, but is not implemented in the file systemlayer, but rather as a feature of the CIFS protocol engine in Windows2003 R2, so this feature is only available when accessing Windows 2003via the CIFS protocol. See CIFS below.
NWFS
The NetWare [traditional] File System is used in NetWare 3.x through5.x as the default file system, and is supported in NetWare 6.x forcompatibility. It is one of the fastest file systems on the planet,however it does not scale, nor is it journaled. An Open Source versionof this file system is available on Linux to allow access to its filedata. However, the OSS version lacks the identity management tie-ins soit has found little utility. Customers of NWFS are encouraged toupgrade to NSS.
OCFS2
The Oracle Cluster File System v2 is a symmetrical parallel clusterfile system specifically designed to support the Oracle RealApplication Clusters (RAC) Database. While it supports general fileaccess, it does not scale in number of files (like EXT3 withouthtrees). It is the first (and so far only) symmetrical parallel clusterfile system to be accepted into the Linux Mainline Kernel (January2006).
PolyServe Matrix Server
Matrix Server is a symmetrical parallel cluster file system forLinux (and Polyserve has a version for Windows servers as well). Rootedin technology from Sequent Computers, Matrix server is the premierparallel cluster file system on Linux today. It boasts order ofmagnitude performance over competing cluster parallel filesystems (GFS,GPFS, OCFS2 etc.). It should be used when parallel cluster file systemscaling is needed.
ReiserFS
The Reiser File System is the default file system in SUSE Linuxdistributions. Reiser FS was designed to remove the scalability andperformance limitations that exist in EXT2 and EXT3 file systems. Itscales and performs extremely well on Linux, outscaling EXT3 withhtrees. In addition, Reiser was designed to very efficiently use diskspace. As a result, it is the best file system on Linux where there area great number of small files in the file system. As collaboration(email) and many web serving applications have lots of small files,Reiser is best suited for these types of workloads.
VxFS
The Veritas File System is closed source. The Veritas full storagesuite is essentially the Veritas File system that is popular on Unix(including Solaris). Approximately 70% of Unix deployments in the worldare ontop of the Veritas File System. As a result, this file system isone of the best to be used when data is to be directly migrated fromUnix to Linux, and when training in volume and filesystem management isto be preserved within the IT staff. The Vertias File System hasexcellent scalability characteristics, just like it has on Unixsystems. Veritas has recently ported their cluster version of VxFS toLinux. Their cluster parallel filesystem (cVxFS) is an asymmetricmodel, where one node is the master, and all other nodes areeffectively read-only slaves (they can write through the master node).
XFS
The XFS file system is Open Source and included in major Linuxdistributions. It originated from SGI (Irix) and was designedspecifically for large files and large volume scalability. Video andmulti-media files are best handled by this file system. Scaling topetabyte volumes, it also handles great deals of data. It is one of thefew filesystems on Linux which supports Data Migration (SGI contributedthe Hierarchical Storage Management interfaces into the Linux Kernel anumber of years ago). SGI also offers a closed source cluster parallelversion of XFS called cXFS which like cVxFS is an asymmetrical model.It has the unique feature, however, that it's slave nodes can run onUnix, Linux and Windows, making it a cross platform file system. Itsmaster node must run on SGI hardware.
There are fewer file access protocols than file systems, and theircapabilities vary more widely than file systems do. For the purposes ofthis discussion, only the popular file access protocols in productionin the market will be discussed.
AFP
The Apple Filing Protocol. Specifically designed and developed byApple for the Macintosh Networking (originally AppleTalk over phonewire hardware, now TCP/IP, since 1997, over any hardware medium thatsupports TCP/IP). This protocol is the best for supporting Apple'sMacOS desktop machines in a network. The specification for thisprotocol is openly available from Apple. The NetAtalk modules in Linuximplement the AFP protocol (and still implement the AppleTalk transporteven though Apple has end of lifed the AppleTalk transport in favor ofTCP/IP). The AFPD module in the NetAtalk package can use either TCP/IPor AppleTalk as a transport.
CIFS
The term CIFS was coined by Microsoft meaning "Common Internet FileServices" when Microsoft first introduced the workstation peer to peerfile sharing protocol verbs to the open community. Subsequent protocolverbs have been held proprietary and include increased richness andmanagement. CIFS (as implemented in Windows 2003) not only includesFile Access verbs, but a whole suite of management verbs and otherprotocols that are used by Windows servers and client desktops. TheCIFS protocol originally operated over NetBEUI network protocol, andtunneling through TCP/IP was added in the early 1990s. In 2000,Microsoft introduced native TCP/IP support for CIFS. Microsoft recentlyintroduced an option into Release 2 or Windows Server 2003 called"Access Based Enumeration". When enabled, this feature will restrictsub-directory visibility to users. That way, users can only see thesubdirectories to which they have rights to see, and others are out ofsight and not seen. This increases security. This feature is enabledper network Share on the Windows 2003 server. The client desktop fullprotocol suite specifications are available for a royalty license fromMicrosoft (the MCPP). For Linux, the Samba team has developed an OSSversion of CIFS based on reverse engineering of the wire protocol ofMicrosoft Windows machines.
FTP
File Transfer Protocol is one of the most common and widely usedsimple protocols in the internet today. Virtually all platforms anddevices support FTP to some level. FTP is a very simple protocolallowing for uploading and downloading of files. There's no richnessfor sharing (locking, coordination, contention, etc.) in the protocol.FTP is used broadly for transferring files. The specification is allopenly available via the IETF.
HTTP
Hyper Text Transfer Protocol is the dominate protocol on the WorldWide Web today, and is the one spoken by web browser clients and webservers. It too is like FTP in that it is not rich, and is designedstrictly for transfers of HTML (Hyper Text Markup Language). It alsotransports additional Markup Languages that have been invented, such asXML (eXtensible Markup Language). The specifications are all openlyavailable via the IETF.
Lustre
Lustre is a unique distributed client server protocol. Itspecifically breaks the functions of a file system up at the protocollayer in order to gain huge scalability for great numbers and verylarge files (like seismic data for petroleum exploration). Lustre isspecifically tied to the Linux EXT3 file system for disk storage, butit effectively builds a very large virtual file system out of manynodes in the cluster. Some nodes are dedicated to holding metadata,others are dedicated to holding specific parts of the greater virtualfile system. This is required by HPC clusters in order to allowperformant access by thousands of compute nodes to up to petabytes ofdata simultaneously. Lustre is the dominant file system used in HPCclusters today. Cluster File Systems Inc. builds and maintains Lustre.Previously, they would only opensource the older version and keep thecurrent version closed source, Cluster File Systems Inc. is changingthis approach, looking to put the most recent into the Open Source andhope to have it accepted into the Linux Mainline Kernel soon.
NCP
The Novell Core Protocol is the client server protocol developed byNovell for supporting DOS, Windows, OS/2, Macintosh, Unix (UnixWare),and Linux for shared file services over Novell's history. It is a veryrich file protocol as it supports the semantics of all of these nativeoperating systems. Novell has reduced the active support to Windows andLinux desktops with the NetWare client, as well as to the Xtier serverfor middle tier file access in the new decade. Originally supportedonly over the IPX network protocol, in 1993 Novell tunneled NCP overIPX through TCP/IP. In 1998 Novell added native support for TCP/IPprotocol. Novell has adding NCP support to Linux desktops in order toallow the new Novell Linux Desktop to interoperate with installed baseof NetWare servers, and to expose unique capabilities of NetWare toLinux desktops. As part of Open Enterprise Server, Novell is alsosupporting NCP on Linux servers to allow desktops running the Novellclient to access data running on Linux. The NCP Server on Linuxincludes emulation for the Trustee rights model and inheritance plusvisibility when run over traditional POSIX file systems (such as EXT3,Reiser, etc.). When run over NSS on Linux, these capabilities aresynchronized with the NSS file system. Visibility in this mode isimplemented much like how Microsoft's Windows 2003 R2 "Access BasedEmumeration" is implemented: in the file access protocol and not thefile system. The specification for this protocol is openly availablefrom Novell.
NFS v3
Network File System version 3 was introduced as a standard via theIETF by Sun Microsystems in the mid 1990s. NFS v3, unlike the otherfile access protocols, is an exported file system. This means thataccess and security are enforced at the NFS client, and not the NFSserver. As a result, NFS is easily hacked if not on a dedicated securenetwork. NFS v3 is a stateless protocol like HTTP and FTP, so suffersperformance since it must assert current state with each operation (forexample, it does not define Open and Close file, only Read and Write).File locking was added with sideband protocols, but is only advisory innature (not hard enforced, meaning it can be hacked on a network). NFShas found its niche as the distributed exported file system protocolused inside the confines of a physical data center hooking applicationservers and databases to storage. It has also seen use in Unix andLinux based smaller workgroups where security between users is not anissue. Various RFCs in the IETF define NFS. Therefore, itsspecifications are freely available via the IETF.
NFS v4
In order to address the security issues of NFS v3, as well as definea network protocol specification that can handle future needs, the NFSv4 specification was proposed to the IETF. The effort was lead by Sunand Network Appliance, with other vendors joining in. The specificationwas approved in late 2003, and then issues discovered during initialimplementations resulted in updated RFCs bringing the specificationeffectively to v4.1. NFS v4 defines extensible and rich set of fileaccess verbs. The protocol is a shared file protocol, unlike NFS v3, soit is secure. It also specifies advanced features for Remote DirectMemory Access, Delegations (equivalent to opportunistic locking),extensible rich metadata, and access naming. NFS v4 is currently a workin development, as it is very new in the industry, but holds greatpromise. 2006 will see the first commercial Linux offerings of NFS v4.NFS v4 requires Kerberos v5 authentication, but will also support otherauthentication methods supported under GSSAPI RFCs. Authentication ofsome form is mandatory, as security and access control are enforced atthe Server for NFS v4. In summary, NFS v4 is the next key file accessprotocol based on industry standards to come.
In reading this document, it should become apparent that there doesnot exist an overall general purpose file system and file accessprotocol. Picking the right file system for the data and applicationscreating/accessing that data is what is important. This section laysout some guildelines for picking and building the right file system fora given workload.
GroupWise, Notes, Exchange and other email/collaboration solutionstypically deal with lots of little files. Since only the applicationprocess is accessing the file system, the added overhead of rich ACLand file attributes found in NSS or NTFS is redundant. Thecharacteristics needed are a file system whose performance remainsrelatively constant regardless of the number of files that are in thevolume, and that performs well with small files.Best bets would be ReiserFS, XFS, NSS and VxFS. File systems to stayaway from for large systems (where you'd have more than 10,000 files inthe system) would be EXT2/3, NWFS, FAT32. If you are on a Windowssystem, you are pretty much stuck with NTFS. NTFS scales better thanEXT2/3 NWFS, and FAT32, but not as well as recommended list, so itworks well with medium sized systems.
MySQL, Oracle, SQL, Progress, etc typically deal with a very few,very large files which are left open most all of the time. The bestfile systems for Databases are those which know how to "get out of theway". Virtually any file system with Direct IO capabilities (APIs thatallow the database to directly manipulate the file buffers) can beused. Since Databases do not create many files, file systems which donot scale to many files, but still have Direct IO interfaces will workfine. Essentially, you would want to stay away from FAT32 is all (plusthose that are discontinued support). Since Databases don't need theadded access control features, NSS and NTFS don't have any inherentadded benefits for them. VxFS, Reiser, EXT3, and XFS all arerecommended file systems for Databases (Your Database Vendor mayspecify a file system they have tested with. If so, go with that onesince they will know how to support it). MS SQL server is again stuckto NTFS (NTFS does have Direct IO capabilities that MS SQL serverleverages).
Web services can encompass a broad set of workloads. For simple webservices, one can use virtually any file system. Since these typicallydon't need rich access control file systems, you can avoid the extraoverhead of NTFS or NSS to squeeze out a few more percentage points inperformance. However, if the web services solution leverages identityand requires user security one from another for many people (more than50 accounts), then the management advantages for access control andsecurity begin to out-weigh the small system performance gains, and NSSor NTFS begin to be better choices. Even complex web services solutionstypically do not require the file system scalability that Collaborationapplications require (unless it is a web services based collaborationpackage). Online merchandising sites typically utilize a relationaldatabase as the datastore, and in those cases, you would choose a filesystem to support your database.
Generally there are two types of NAS use cases: Serving files toapplication servers in a tiered service oriented architecture (SOA),and serving files to end users desktops and workstations. The formerhas minimal access control requirements. The latter has quite heavyaccess control requirements. Typically for serving files to applicationservers (traditional NAS), one would choose a file system that isscalable and fast. Reiser, XFS, VxFS come to mind for NFS file serving.For file serving to end user workstations, the access control andsecurity management capabilities of NSS and NTFS file systems with CIFSand NCP file access protocols begin to become important. NSS's modeldoes better than NTFS for very large numbers of users. These two filesystems allow for security between users and at the same time allow forvery fine granular sharing between given users and groups. NSS includesa visibility feature implemented in the file system which preventsunauthorized users from even seeing subdirectory structures they don'thave rights to. CIFS in Windows 2003 R2 includes a similar visibilityfeature called "Access Based Enumeration", however, it is implementedin the file access protocol, not the NTFS file system, so is onlyavailable when access the file system via CIFS (which are traditionalMicrosoft network Shares).
Parallel Cluster File systems are relatively new in the market andoffer the ability to scale out an application or service (increasingthroughput). HOWEVER, it must be well understood that not allapplications or services can take advantage of parallel cluster filesystems for scale out. Applications/services which have been properlydesigned can be run simultaneously on 2 or more nodes accessing thesame data in a parallel cluster file system. These are cluster parallelenabled. Others which are not parallel cluster enabled can only run onone node at a time in the cluster, even though their data is accessibleby all nodes simultaneously. If they attempt to run on more than onenode simultaneously, crashing or data corruption may occur. Yourapplication or service vendor should know if they support this or not.To assist in determining if an application is parallel cluster enabled,the following points are helpful:
聯(lián)系客服