|
Non-Processor IP Overview |
|||
IntroductionThis page is an overview of existing RACORS non-processor IPs. For each IP the most important features and properties are listed. Some common properties of all RACORS non-processor IPs can be foundhere. For more detailled information please send an email to info@racors.com Most of the IP blocks on this page have been designed as peripherals for RACORS processors and have peripheral interfaces for glueless connection to a 16-bit or 32-bit RACORS peripheral bus. Memory master interfaces are designed to connect to multi-port memory arbiters as described in the SDRAM/DDR-SDRAM section of this page. Blocks with other standard or non-standard peripheral and memory interfaces can be derived on request. IP Groups
Fast Ethernet MAC
Gigabit Ethernet MAC Ethernet System IP SD/SDHC/eMMC Host SD/SDHC/eMMC System IP DMA Controllers SPI-FLASH SDRAM/DDR-SDRAM Interface Serial I/O Display Caches Debug Modules System Integration MP3 Decoder MPEG4-ASP Decoder Fast Ethernet MACThese MAC modules are optimized for low resource consumption. FPGA implementations are in the range of 500 LEs and below. The PHY management interface is implemented in software via dedicated GPIO signals of the MAC module. Multiple variants exist with different peripheral interfaces (16-bit or 32-bit), PHY interfaces (MII or RGMII) and packet-data interfaces (local memory DMA or external memory DMA). Common Features
The fast Ethernet MAC modules are fully synchronous designs intended to run at the peripheral bus clock. With an MII PHY interface the receive and transmit clocks are inputs of the MAC module. All input signals driven by the PHY are synchronized to the system clock. Edge detection circuitry is used to synchronize input and output signals of the PHY interface to the receive and transmit clocks. With an RGMII interface the transmit clock is an output signal of the MAC module with 25MHz/2.5MHz for 100/10Mbit Ethernet. To keep the MAC module fully synchronous the system clock must be a multiple of 25MHz, e.g. 100MHz so that the transmit clock can be derived from the system clock. Typical use case of variants with local DMA for packet data are small processor systems with no external memory. Packet data are transferred to/from the processor's data RAM via dedicated DMA channels. This is more flexible than using packet buffer RAMs conected to the MAC module. With the local DMA concept software determines the number and size of receive/transmit packet buffers and where they are located in the processor's data RAM. Gigabit Ethernet MACAs with the fast Ethernet MAC modules the GBit MAC is optimized for low resource consumption. The PHY management interface is implemented in software via dedicated GPIO signals of the MAC module. At the moment a single implementation with 32-bit peripheral-I/f and RGMII PHY-I/f is available. Other variants can be derived on request. Features
Due to the special clocks required for the RGMII PHY interface the GBit MAC cannot be fully synchronous. Most of the logic including peripheral interface and memory interfaces run from the system clock. A separate 250MHz clock is required for the RGMII transmitter. The RGMII receiver is clocked by the 125MHz receive clock either directly or phase corrected by a PLL (depends on the phase relation of the RGMII receiver input signals). Ethernet System IPThese are intelligent IP blocks containig a processor and firmware. An Ethernet MAC, a DMA controller instruction and data RAMs are connected to the processor. Local firmware handles Ethernet packet transmission and reception with buffer management, optional packet filtering and local packet generation. With a small processor like the sf16bu the entire block is still rather small, e.g. < 2500 LEs for the examples described below. There are many applications for this type of intelligent IP which can be used either as an intelligent peripheral of a processor system (example 1) or as an autonomous unit (example 2). Example 1: Ethernet MAC with debug interfaceIn this example the IP block is a peripheral of a processor. In addition the block provides the debug-interface for this processor (e.g. JTAG, ...). A single Ethernet port is used for network access and as debug interface. A dedicated UDP port is used for debug messages. Firmware on the local processor inside the IP block filters packets to the debug UDP port and directs them to the debug interface. Data received from the debug interface is sent back to the debug Host using the same UDP port. All remaining receive/transmit packets are transferred to the main memory of the system for access by the main processor. The filtering of debug messages is completely transparent to the main procesor. With -2500 LEs + 13kBytes block RAM for 100Mbit Ethernet this block provides a very efficient Network+Debug solution for FPGAs. Example 2: PCM audio data streaming to an audio DACThis is an example of an autonomous IP block. It contains a processor with firmware, an Ethernet MAC, a DMA controller, instruction and data RAMs. UDP packets with 1kBytes PCM payload are sent to the block, which outputs the PCM data as a continues stream to an audio DAC. A 6-deep receive packet buffer is implemented to enable uninterrupted audio. Depending on the network environment (load, latency of handshake packets) a larger packet buffer may be required. The firmware inside the IP block contains a partial protocol stack up to the UDP layer. To keep the instruction and data RAM small only protocol functions that are required for the PCM streaming are implemented:
On an FPGA the example IP block takes ~2300 LEs and 13kBytes of block RAM. The same basic comcept can be used for streaming or non-streaming data transfers to any standard or custom interface. SD/SDHC/eMMC HostThese modules provide full SD/SDHC/eMMC Host controller functionality and require little attention from the system processor. Multiple variants are available with different peripheral interfaces (16-bit or 32-bit), different card-interfaces (1-bit, 4-bit or 8-bit) and different data-interfaces (local DMA or external memory DMA). Common Features
The existing variants are fully synchronous and have been designed for systems with 100MHz peripheral bus clock. Typical card clocks of 50MHz, 25MHz and lower can be generated from the 100MHz system clock by integer division. On request derivatives with more flexible card clock generation can be generated e.g. with a separate 2 x card clock input that is independent from the system clock. SD/SDHC/eMMC System IPIntelligent IP blocks for accessing SD/SDHC/eMMC cards can be created by combining a small processor system with an SDHC Host controller and firmware. Data can be transferred directly between user interfaces and files on the SD/SDHC card which leads to many interresting use cases and applications. Users don't have to be familiar with the SD card command protocol. Depending on the user interface(s) and variant of SD/SDHC/eMMC Host controller total IP block sizes for FPGAs in the range of 2000-3000 LEs plus 10-20 block RAMs can be realized. The maximum read/write data rates that can be achieved depend on the card speed, the RAM available for buffering and the way data is stored of the card (file system or raw, sequential block read/write). With fast SDHC cards, sufficient buffering and sequential read/write read rates of 20MBytes/s and write-rates of 10MBytes/s can be achieved. With eMMC chips significanlty higher speeds with read-rates >100MBytes/s and write-rates >50MBytes/s are possible. DMA ControllersMost systems based on RACORS processors have a DMA controller. In MCU like FPGA systems DMA to/from the processor's local (on-chip) data RAM is used for data transfers to/from peripherals and for transfers to/from external memory. For FPGAs a generic DMA controller architecture with configurable #of channels and channel parameters would waste resources. As a consequence many variants exist that are optimized for specific use cases. For new projects the variant with the best matching features is used as template and adapted to meet the project specific requirements. DMA to/from external memory instead of a data cacheThese special purpose DMA channels are used to copy data from the processor's local on-chip data RAM to external memory (typically SDRAM or DDR-SDRAM) and vice versa. They are special not from a DMA controller point of view but from the overall system concept. The channels are used as alternative to a data cache. In I/O intensive systems with a moderate share of control software using DMA instead of a data cache in many cases is more efficient. Because data from I/O channels is volatile (e.g. network packets) the hit rates of a data cache are low. With DMA, data can be pre-fetched from external memory and then accessed in the processor's local data memory with zero wait states. In a similar way output data is copied to external memory by DMA in the background while the processor can work on other tasks. The concept is particularly usefull with large data access units like e.g. Ethernet packets, mass-storage data blocks or graphics line buffers. With larger access units the software overhead for setting up the DMA operations is small compared to the performance gain over a data cache solution. SPI-FLASHThis memory type gets a separate section here because it is so important for FPGAs. Most of the high density FPGA families for digital designs need external non-volatile memory to load the FPGA hardware configuration at power up time. SPI-FLASH is the most attractive option because of low pin count and high speed. Beside startup configuration data other data for different purposes can be stored in the same SPI-FLASH device:
Multiple variants of SPI-FLASH controllers are available that can be used e.g. to access the FPGA configuration device during normal operation. Debuggers for RACORS processors support erase, program and reading of the configuration SPI-FLASH via the debug interface. SDRAM/DDR-SDRAM InterfaceThis functionality is divided into two separate modules:
Physical Interface ControllerThese modules are optimized for high throughput and resource efficiency. Parameters like burst length, CAS latency and address segmentation (page, row, bank-select) are fixed for a particular memory type and use case. For best performance memory banks are kept open as long as possible. Bank precharge commands are issued only in case of time-out (timer per bank) or page misses. The refresh controller uses periods of no memory access for refresh commands as much as possible. Only if there are un-refreshed rows left close to the end of the maximum refresh period refresh commands take priority over memory access requests. Multiple variants are availabe for different memory types (SDRAM or DDR-SDRAM) different memory data width (16-bit or 32-bit) and size (address bits). The memory architecture determines the #of row address bits and #of column address bits. in the overall address space as seen by the arbiter module the bank select bits (typically 2 bits for 4 banks) are placed in the middle between page address bits (low bits) and row address bits (high bits). This enables efficient linear access across page boundaries. Physical Interface Common Properties
Physical controllers are fully synchronous designs except for a small part of DDR-SDRAM controllers that runs at twice the system clock. The clock output to the memory device(s) requires a separate PLL output with a special (tuned) phase relation to the system clock. Multi-port ArbiterInstead of a memory bus a multi-port arbiter is used with separate address and data channels for each client. This concept has the following advantages:
To further illustrate the concept here is an example of an arbiter from an existing FPGA design:
The debug interface in this case has a UART interface to the Host PC. It is given the highest priority because with the low bandwidth and short burst it cannot negatively impact available bandwidth or latency of the other clients. The SD card has the next highest priority because of the relatively low bandwidth compared to the remaining clients. Next are the Ethernet clients followed by the DMA to/from the processor's local data RAM which uses the left over bandwidth. The fast clients have burst buffers of 2 x max burst-length size to de-couple the data rates of the memory and the client interfaces. All burst buffers of one direction (read or write) are mapped into a single 1kBytes block RAM which is more efficient than using local buffers in the clients. Serial I/OFollowing a resource optimized concept for FPGAs most serial I/O functions are rather small and simple designs. In many cases there is more effort adapting the interfaces to the target system (DMA, interrupts, etc) than developing the actual I/O function. The following paragraphs provide some general properties and infos about available blocks. SPIProbably the most popular of the 'old' serial interfaces because it is used for chip to chip connections and is not a highly standardized device interface. Recently SPI became even more popular again as a low-pin alternative to connect NOR-FLASH memories to SoCs. For higher data rates dual and quad SPI versions with 2 or 4 data lines and clock rates > 100MHz are available. Resource efficient implementations optimized for a particular use case are either master or slave. To support high data rates receive/transmit FIFOs or DMA are required. Many variants are available that differ in one or more of the following properties:
A special feature implemented in some variants with FIFOs is support for misaligned access to the FIFO. With protocols that contain e.g. a mixture of 8-bit, 16-bit and 32-bit data objects these objects can be written to and read from the FIFOs with one processor access regardless of their order and alignment in the byte stream. IICSlow but still popular as a low pin count (2-wire) chip to chip interface to access configuration or status registers. IIC is also the most common interface for EEPROMs. In most use cases today the clock line is a master output and slave input and not bidirectional as in the original specification. In many use cases software that emulates the protocol via GPIO pins is sufficient especially if the interface is used only for initialization to setup configuration parameters. A hardware implementation makes sense e.g. to access EEPROMs in the background with little or no processor load. A special property of EEPROM IIC slaves is that they use the acknowledge mechanism to indicate when a program operation is finished. The master has to implement so called acknowledge polling to sense when the EEPROM is ready for the next program or read operation. Beside software implementations two variants are available mainly targeted at EEPROM slaves. The first variant is interrupt driven and transmits/receives one byte at a time. The second variant has block RAM based read/write FIFOs to program or read longer byte sequences autonomously. Both variants support acknowledge polling to efficintly handle EEPROM slaves. UARTNot very popular anymore because of low speed and new PCs don't have a UART interface anymore. Still usefull as debugging interface for processor systems with small memory foot prints. In most cases the protocol setting is 1 start-bit, 8 data bits, no parity, 1 stop bit. Programmability of these parameters is not really necessary. The low end debug modules for RACORS processor systems have a half-duplex UART interface to connect to the debug Host. A special feature of this UART is the auto-baud rate detection. After a reset the Host sends a 0x80 character. The UART module measures the length of the low period which is 8 bit times, sets its baud rate generator accordingly and sends back a 0x47 character to indicate to the Host that the connection was successfull. Some variants of available UART based debug modules allow the processor to use the UART as a communication peripheral. This is particularly usefull in FPGA systems where resource efficiency is most important. A select input from a pin with jumper or switch determines if the UART is used a debugging interface or as terminal/console interface. IISThe most common interface for audio ADCs/DACs. The bandwitdh is rather low, e.g. for 44.1kHz Stereo in/out, 16-bit samples the bandwidth is 352.8 kBytes/s. Most audio ADCs/DACs have an additional I2C interface to access control, configuration and status registers. Multiple variants are available most of them with small transmit/receive FIFOs and no DMA. The following example describes an IIS block with 16-bit peripheral interface and simultaneous transmit/receive operation.
The concept is that interrupts are generated when 1 sample-pair is left in the transmit FIFO and when 4 samples are contained in the receive FIFO. This provides enough time for the processor to react and write the next 4 samples into the transmit FIFO or read the next 4 samples from the receive FIFO. PS/2Has been the standard interface to connect mice and keyboards to PCs for a long time. Replaced by USB since quite a while. Still usefull to connect mice and/or keyboards to low cost FPGA processor systems with no USB. PS/2 is much smaller than USB and the software driver is much simpler. Some RACORS processor systems have a PS/2 interface to connect a keyboard. Typically the PS/2 function is part of a system integration module that also contains a periodic interrupt timer and interrupt controller. Because of the low speed the PS/2 function is interrupt driven and very small. The hardware is < 100 LEs and consists only of a clock pre-scaler a shift-register, synchronizers and a few status flags. IR RemoteBecause of the low bit rates almost no hardware resources are required for Infra Red remote control receivers. The hardware part consists only of an edge detector. Every edge of the input signal generates an interrupt. Software measures the time from edge to edge using a system timer (available anyway for other purposes). The sequences of time intervals are analyzed to determine the codes sent by the remote control. DisplayDisplay controllers cover a broad range regarding display type, complexity and resource consumption. Solutons for the following display types are available:
LEDsCan be driven easily by GPIO pins. For larger numbers of LEDs multiplex schemes are used to limit the number of GPIO pins required. The multiplex rate can be low. E.g. a 16-channel mux scheme with 100Hz refresh rate requires switching a 1600Hz which can be easily implemented with timer interrupts and software. In many cases dedicated hardware is not required. Higher mux rates are required to control the brightness of each LED within a LED matrix individually. This can be done by PWM schemes. Because of the strong non-linearity of brightness vs. agerage current high PWM resolution is required. E.g. for 16-channel mux, 100Hz refresh and 8-bit brightness with 12-bit PWM resolution a switch rate of ~6.5 MHz is required. This is too fast for an interrupt and software solution. A hardware block makes more sense. Monochrome LCD DisplaysDisplay units include a controller/driver chip with a MCU like bus interface with 4-bit or 8-bit data width and a couple of control signals. Most popular are character based displays with built in character generator but bit map displays with individually addressable pixels are also available. The controller/driver chip includes the frame buffer RAM and performs the display refresh. The driving processor writes to the frame buffer RAM only to change the display content. The interface is slow and can be emulated with GPIO pins. But because of the slow timing using waiting loops between signal transitions would take too much processor time. The implemented solution uses a software state machine and timer interrupts to drive the display with minimal processor load. The time between signal transitions is used by other software tasks. TFT color LCD DisplaysThe standard for high resolution, high quality displays. Meanwhile has widely replaced CRT displays. The available modules have been developed for particular use cases and displays and have fixed video time base parameters. They differ in the following properties:
Frame buffers are typically mapped into the system's main memory pool, usually external SDRAM or DDR-SDRAM. The controllers have on-chip pre-fetch buffers with two transfer request priorities based on the fullness of the pre-fetch buffer. This is to make the display a nice memory client that generates high priority requests only when the pre-fetch buffer is almost empty and otherwise uses the system's 'left-over' bandwidth to fill its pre-fetch buffer. CachesThe available modules are designed for RACORS processors and are optimized for particular use cases with fixed parameters regarding cache-size and main memory size. Derivatives with other sizes can be generated with little effort. Two examples are given. The first is a read-only instruction cache and the second is a data-cache. Instruction-Cache Example
Data-Cache Example
Debug ModulesThese are debug modules for the RACORS processors which all have the same debug concept. The cores have a debug interface to stop and restart instruction execution and to inject and execute individual instructions. At a minimum the debug modules have interfaces to connect to the processor core and to the debug Host PC. Most of the available variants are for small FPGA processor systems with on-chip instruction and data memories. These variants have a UART interface to connect to the Host PC and an extra interface to the processors's instruction RAM to download program code, to set break points and for single-step instruction execution. Here is an example feature list of a module for sf32 processors:
To read or modify processor registers, data memory locations or peripheral registers the processor must be stopped first by a command from the debug host. The actual accesses to memory or registers are then done by instructions injected by the debug Host and executed by the core. Due to the dedicated interface the instruction RAM can be accessed (read and write) at any time also while the processor is running. Variants with similar features are available for all RACORS processors. Depending on the target system variants of UART based debug modules have these additional features:
System IntegrationThese modules are peripherals of processor systems that contain some frequently used standard functions or functions that are too small and simple to become a separate peripheral module. They can also be considered glue logic of processor systems. Typical functions are:
A periodic interrupt timer is used as OS tick or to schedule other periodic system events. Typical tick intervals are in the range of 1ms to 10ms. The RACORS processors all have 16 interrupts and each interrupt has a separate, software defined start address. The system interrupt controller takes all interrupt signals as inputs and selects the next interrupt to be forwarded to the processor based on a fixed or programmable priority scheme. MP3 DecoderThis is mainly software IP. Only a bit string reader hardware block is used to accelerate reading of arbitrary length bit string from the compressed audio stream. The decoders are implemented as hand optimized assembly code for the eco32bl and eco32dl processors. More information can be found here MPEG4-ASP DecoderThis IP is a combination of hand optimized assembly code for the eco16il processor and a number of hardware acceleration blocks. More details can be found here |
||||