Villaggio Informatico | Notes and summaries of Computer Science, Systems and Networks

Files

To manage data automatically using an application program (also called an app), the first thing that comes to mind is organizing the data in the form of a file. A data file, also called an archive, is a collection of many rows, each of which is called a record. Each record is made up of a predetermined set of elementary data, called attributes or fields.

You can think of an archive as a table, where the rows represent the records, while the columns represent the attributes. Examples of archives useful for managing data are Excel files or CSV (Comma Separated Value) files. In CSV files, the various fields are separated by a separator, typically a comma or semicolon. In any case, you can Use any text file (TXT) whose organization is known to the application program that needs to use that data.

Archive Operations

In general, the following operations must be possible on a single record, according to the aforementioned CRUD method:

creating a record;
reading a record;
modifying a record;
deleting a record.

An archive, however, since it is made up of a set of records, must also have access functionality, that is, the ability to locate a specific record within the file. The logical operations that can be performed The tasks usually possible to perform within an archive are therefore:

creating an empty archive;
accessing an appropriate record;
inserting a record in the appropriate position;
reading a record in the appropriate position;
modifying a certain record;
deleting a certain record;
deleting an archive.

Access Methods

If we focus on the logical organization of a file, we can distinguish three types of organization, which differ in the access method for the various records. The access method determines, essentially, how a record is searched for within the file. The three methods Access methods are sequential, direct, and indexed, as described below.

Sequential Access

Files organized with sequential access require each record to be inserted after the previous one, so to find a certain record, you must scroll through the file until you find the desired record. This is the easiest access method to manage but the least efficient in terms of search time.

Direct Access

Even in files organized with direct access, records are usually inserted one after the other, but each record is associated with a number indicating its position. In this case, therefore, if you know the position of the record, you can access that record directly without scrolling through the entire file. The problem is that there is generally no correlation between the position of the record and its contents. This correlation must be implemented at the application level, for example with hashing algorithms.

Indexed Access

Files organized using indexed access are an evolution of direct access files. In indexed access files, a fundamental attribute must be identified that uniquely identifies a record. Based on this attribute, an index is built, which is paired with the actual file, which is organized using direct access. The index identifies, for each value of the fundamental attribute, the position of the entire record in the file. Since the index is much smaller than the file itself, it can be kept organized, greatly facilitating the search operation.

Limitations

Archives are useful only when relatively little data needs to be managed via a single application. In fact, using archives makes it impossible to define logical relationships that link data to each other. Furthermore, although archives are independent of the application that uses them and can therefore be used by multiple applications, it is precisely in this case that all the limitations inherent in using archives become evident.

To summarize, we highlight the following limitations:

Data redundancy;
Data inconsistency/inconsistency;
Poor concurrency management.

The problem of data redundancy arises especially when a given application needs to know, due to confidentiality concerns, only part of the data is available. In this case, it is necessary to duplicate the data, which obviously wastes memory.

Furthermore, if the data needs to be updated, there is an obvious inconsistency problem: some applications may use outdated information. Similarly, if the archive structure is modified, the changes must be applied to all applications, otherwise inconsistency problems arise (for example, an application accesses the third field of a record assuming it is the phone number, but instead the third field becomes the email address).

Finally, for the archive to be usable, while one application accesses the archive, other applications that want to access it must wait for the first to finish processing. This is an inefficient mechanism for managing concurrency between processes, especially in the case of archives with large amounts of data.

<↑