101: What is Latency? « A-Team Group
Latency, n. The delay between the receipt of a stimulus and the response to it.
Network Latency
Whether local area, wide area, or metropolitan, owned or managed, lit or dark, your network is the piece that physically joins your components together, transporting bits from A to B. Networks and their associated components (switches, routers, firewalls and so on) tend to introduce three types of delay between stimuli and responses: serialisation delay, propagation delay, and queueing delay.
Serialisation delay is the time it takes to put a set of bits onto the physical medium (typically fibre optic cable). This is determined entirely by the number of bits and the data rate of the link. For example, to serialise a 1500 byte packet (12,000 bits) on a 1Gbit/second link takes 12 us (microseconds, or millionths of a second). Bump the link speed up to 10Gb/s and your serialisation delay drops by an order of magnitude, to 1.2 us.
But serialisation only covers putting the bits into the pipe. Before they can come out, they have to reach the other end, and that’s where propagation delay comes into play. The speed of propagation of light in a fibre is about two-thirds that in a vacuum, so roughly 200 million kilometres per second (which is still pretty fast, to be fair!). To put it another way, it takes light in a fibre about half a microsecond to travel 100 meters.
Pause and think about the relative sizes of serialisation and propagation delay for a moment. Over short distances (e.g. in a LAN environment) the former is much larger than the latter, even at 10 Gb/s link speeds. That’s one of the reasons why moving from 1 Gb/s to 10 Gb/s can have a big impact on network latency in LANs. This advantage diminishes significantly with distance, however – on a 100km fiber, propagation delay is going to be on the order of 500 us, or half a millisecond. At 1Gb/s link speeds serialisation plus propagation delay for a 1500 byte packet is about 512 us; moving up to 10 Gb/s reduces this to about 501.2 us – hardly a massive improvement.
There’s not a lot you can do about propagation latency – it’s a result of physical processes that you can’t change. The only real option is to move your processing closer to the source of the data – basically what’s happening with the move towards collocation. Even this has challenges though – if your trading strategy depends on data from multiple exchanges, where should you collocate? An interesting recent study from MIT suggests the optimal approach may be a new location somewhere between the two!
The final contribution to network latency is queuing delay. Consider two data sources sending data to a single consumer which has a single network connection. If both send a packet of data at the same time, one of those packets has to be queued at the switch which connects them to the consumer. The length of time for which any packet is queued is dependent on two factors: the number of packets which are in the queue ahead of it, and the data rate of the output link – this is the other reason why increasing data rates helps reduce network latency, because it reduces queuing delays. Note that there’s a crucial difference between serialisation delay, propagation delay and queuing delay – the first two are deterministic, the latter is variable, being dependent on how busy the network is.
Protocol Latency
Network links just provide dumb pipes for getting bits from A to B. In order to bring some order to these bits, various layers of protocols are used to deal with things like where the bits should go, how to get them in the right order, deal with losses, and so on. In the vast majority of cases these protocols were designed with the goal of ensuring the smooth and reliable flow of data between endpoints, and not with minimising the latency of that flow, so they can and do introduce delays through a variety of mechanisms.
The first such mechanism is protocol data overhead. Each protocol layer adds a number of bytes to the packet to carry management information between the two endpoints. In a TCP/IP connection (the most common type of connection) this overhead adds 40 bytes to each packet. This is additional data that is subject to the serialisation delay discussed above. The relevance of this overhead is very much dependent on the size of data packets – for example, if data is being sent in 40-byte chunks, the TCP/IP overhead will double the serialisation delay for each chunk, whereas if it’s being sent in 1500-byte chunks, TCP/IP will only increase serialisation delay by around 3%.
It is possible to enable header compression on most modern network devices. This can have a significant impact on packet sizes, reducing the 40-byte TCP/IP header down to 2-4 bytes – for small data payloads this might halve the packet size. However, in latency terms this is unlikely to have much impact as the reduction in serialisation delay would be offset by the time taken to compress and decompress the header at each end of the link.
Clearly, then, if you have a lot of data to send it’s preferable to send it in one large packet rather than lots of small ones. However, it obviously doesn’t make sense from a latency perspective, to delay sending one piece of data until you have more available, just so you can fill a larger packet. Unfortunately, that’s exactly what TCP does in some configuration - using a process called Nagle’s algorithm, a TCP sender will delay sending data until it has enough to fill the largest packet size as long as it still has some data waiting to be acknowledged by the receiver. Thankfully for those with latency-sensitive applications this option can be turned off in most implementations.
One of the uses of data that contributes to protocol overhead is to implement a mechanism called congestion control. In order to prevent networks becoming congested with data that can’t be delivered to a destination, each TCP connection has a data ‘window’ which is the number of bytes that the sender is allowed to transmit before it must wait for permission from the receiver to send more. This permission flows back from the receiver to the sender in the form of acknowledgement or ACKs, which are typically sent when a packet is correctly received. In a well-functioning network, with constant data flow in both directions, this mechanism works extremely well.
There are circumstances, however, where it can introduce problems. Imagine a window size set to 15,000 bytes, and a data producer sending 1,500 byte packets. If the producer is able to send ten packets before the consumer receives the first packet (that is, if the ‘pipe’ can accommodate more than ten packets at a time), then it will have to stop after the tenth and wait until it gets permission to go again. This can add significant latency to some packets. In order to avoid this ‘window exhaustion’ it is necessary to configure the parameters of the TCP stack to align with the network connections – specifically, the window size has to be greater than the delay-bandwidth product, the number you get when you multiply the one-way propagation delay by the data rate.
As an example, a 100 Mb/s link from New York to Chicago with a one-way latency of 12 ms (milliseconds) requires a TCP window size of 1.2 Mbits, or 150 KB. If you had a connection like this, and were using a more standard window size of 64 KB, then your data transfer could stall due to window exhaustion every 5 ms (time taken to send 64 KB at 100Mb/s), with each stall introducing 12 ms of additional latency.
The final way in which protocols can introduce latency is through packet loss. As mentioned earlier, there are occasions when packets need to be queued at switches or routers before they can be forwarded. Physically, this means holding a copy of the packet in memory somewhere. Since network devices have finite memory, there is a limit to the number of packets which can be queued. If a packet arrives and there is no more space available, it will be discarded. In this situation, TCP will eventually detect the missing packet at the receiver, and a re-transmission will occur; however, the delay between original transmission and re-transmission is likely to be on the order of a least three round trip times (RTTs), or six times the one-way propagation delay. As a result, packet loss can be one of the biggest contributors to network latency, albeit on a sporadic basis.
In a network which you own and control the likelihood of packet loss can be minimised by ensuring appropriate capacity is provisioned on all links and switches. If your network includes managed components, especially in a WAN, this is much more difficult to achieve, although your service provider will likely provide some SLAs on packet loss.
It’s worth noting that all of the preceding discussion is focussed on TCP, the Transmission Control Protocol. This protocol was designed to ensure guaranteed delivery of data between applications, rather than timely delivery. Many trading applications use UDP (the User Datagram Protocol) as an alternative to TCP. UDP does not guarantee delivery, so it doesn’t use any of the windowing or retransmission discussed above. As a result, UDP is less latency sensitive, but it is subject to unrecoverable packet loss – if data is discarded due to queuing, there is no way for it to be re-transmitted, of the for the sender to be aware that this happened.
Operating System (OS) Latency
When you’re deploying a trading application, the code has to run on something. That something is typically a set of servers, and those servers have operating systems that sit between your code and the hardware. These OSes are typically designed to provide functionality which makes it easy to run multiple applications on almost any type of hardware; in other words, like TCP, they’re optimised for flexibility and resilience rather than speed. As a result, they can introduce latency through a number of mechanisms.
All modern operating systems are multi-tasking, meaning they can do multiple things ‘simultaneously’. Since servers generally have more things to do than CPUs to do them on, this generally means that any running code can be suspended by the OS to allow another piece of code can be run. This pre-emptive scheduling can introduce variable delays into code. Note that this can be a problem even if your server is only running one application. The reasons is that the application still depends on various inputs and outputs (to users via a keyboard/monitor, to databases, to other servers via the network), and the OS has to make sure the drivers that manage all that I/O get some CPU time. In addition, the OS occasionally has to do some housekeeping work and in some cases this work is non-preemptable, which locks your application out of running until the OS has finished. All told, these types of OS operations can add tens of milliseconds latency to transactions being processed by your application. And, to make matters worse, this latency can be highly intermittent, making it very difficult to address
There are some things you can do to alleviate the problem of OS latency, but most of them are OS-specific – for example, in Windows processes can be set to Realtime priority, or in Linux the kernel can be compiled to allow pre-emption of OS tasks. Neither of these approaches will completely remove OS-latency, but they can reduce it to the millisecond or sub-millisecond region, which may be acceptable for your application. To get beyond this really requires the use of a real-time operating system (RTOS) such as QNX or RTLinux – these are more typically found in embedded systems and are beyond the scope of this article.
Many trading applications are written in programming languages that are executed in a runtime environment that creates yet another layer of complexity between the code and OS. These runtime environments include, for example, Java Virtual Machines (JVMs) for code written in Java, and Windows .Net Common Language Runtime (CLR) for code written in C#. These environments are designed to make code more stable and secure by, among other things, eliminating common programming problems. One of the most common functions of these runtime environments is automatic memory management or garbage collection. This refers to a mechanism whereby the runtime environment monitors the memory being used by an application and periodically tidies it up, reclaiming any memory which the application no longer needs. While this process improves program stability (memory management being one of the most common categories of coding defects), it has a latency cost because the application must be temporarily suspended which garbage collection takes place.
Some work has been done in these environments to minimise the impact of garbage collection, and the Java community has gone as far as creating a separate Real-Time Specification for Java (RTSJ). However, all other things being equal, code that is running in a managed environment will tend to incur more latency problems than code that is directly under OS control. In this case you need to make a trade-off between the improved stability (and possibly faster development cycles) provided by Java/C# and the improved latency of something like C/C++.
Application Latency
Phew – all that latency already in the system and we haven’t even talked about your application yet! Thankfully, application latency is one piece of the equation that’s mostly within your own control, and it also tends to be introduced through a small number of mechanisms.
One common theme in IT systems is that things slow down by at least an order of magnitude when you have to access disks rather than memory – database access is a prime example of this. The reason is simply down to the mechanical nature of disks, as opposed to electronic memory. Designing applications to minimise disk access is therefore a common pattern in low-latency systems – in fact, most high-frequency or flow-based applications will have no databases in the main flow, deferring all data persistence to post-execution. This also tends to mean that applications are very memory hungry – data has to be stored somewhere while it’s being processed, and you don’t want it hitting the disk. Where database access is required and latency has to be minimised many developers are now turning to in-memory databases which, as the name suggests, store all of their data in memory rather than on disk. The increasing penetration of solid-state disks (SSDs – devices which appear to the computer to be a standard magnetic disk, but use non-volatile solid-state memory for storage) provides a possible compromise between in-memory DBs and standard DBs using magnetic disks.
Inter-process communication (IPC) is another area that can have a substantial impact on application performance. Typically, trading applications have multiple components (market data acquisition, pricing engines, risk, order routers, market gateways and many more) and data has to be passed between them. When the processes concerned are on the same server this can be a relatively efficient exchange; when (as is often the case) they are on different servers, then the communication can incur significant latency penalties, as it hits all the OS and protocol overheads discussed previously. Remote Direct Memory Access (RDMA) is a combined hardware/software approach that bypasses all of the OS-related latency penalties by allowing the sending process to write data directly into the memory space of the destination process.
Latency, n. The delay between the receipt of a stimulus and the response to it.
Network Latency
Whether local area, wide area, or metropolitan, owned or managed, lit or dark, your network is the piece that physically joins your components together, transporting bits from A to B. Networks and their associated components (switches, routers, firewalls and so on) tend to introduce three types of delay between stimuli and responses: serialisation delay, propagation delay, and queueing delay.
Serialisation delay is the time it takes to put a set of bits onto the physical medium (typically fibre optic cable). This is determined entirely by the number of bits and the data rate of the link. For example, to serialise a 1500 byte packet (12,000 bits) on a 1Gbit/second link takes 12 us (microseconds, or millionths of a second). Bump the link speed up to 10Gb/s and your serialisation delay drops by an order of magnitude, to 1.2 us.
But serialisation only covers putting the bits into the pipe. Before they can come out, they have to reach the other end, and that’s where propagation delay comes into play. The speed of propagation of light in a fibre is about two-thirds that in a vacuum, so roughly 200 million kilometres per second (which is still pretty fast, to be fair!). To put it another way, it takes light in a fibre about half a microsecond to travel 100 meters.
Pause and think about the relative sizes of serialisation and propagation delay for a moment. Over short distances (e.g. in a LAN environment) the former is much larger than the latter, even at 10 Gb/s link speeds. That’s one of the reasons why moving from 1 Gb/s to 10 Gb/s can have a big impact on network latency in LANs. This advantage diminishes significantly with distance, however – on a 100km fiber, propagation delay is going to be on the order of 500 us, or half a millisecond. At 1Gb/s link speeds serialisation plus propagation delay for a 1500 byte packet is about 512 us; moving up to 10 Gb/s reduces this to about 501.2 us – hardly a massive improvement.
There’s not a lot you can do about propagation latency – it’s a result of physical processes that you can’t change. The only real option is to move your processing closer to the source of the data – basically what’s happening with the move towards collocation. Even this has challenges though – if your trading strategy depends on data from multiple exchanges, where should you collocate? An interesting recent study from MIT suggests the optimal approach may be a new location somewhere between the two!
The final contribution to network latency is queuing delay. Consider two data sources sending data to a single consumer which has a single network connection. If both send a packet of data at the same time, one of those packets has to be queued at the switch which connects them to the consumer. The length of time for which any packet is queued is dependent on two factors: the number of packets which are in the queue ahead of it, and the data rate of the output link – this is the other reason why increasing data rates helps reduce network latency, because it reduces queuing delays. Note that there’s a crucial difference between serialisation delay, propagation delay and queuing delay – the first two are deterministic, the latter is variable, being dependent on how busy the network is.
Protocol Latency
Network links just provide dumb pipes for getting bits from A to B. In order to bring some order to these bits, various layers of protocols are used to deal with things like where the bits should go, how to get them in the right order, deal with losses, and so on. In the vast majority of cases these protocols were designed with the goal of ensuring the smooth and reliable flow of data between endpoints, and not with minimising the latency of that flow, so they can and do introduce delays through a variety of mechanisms.
The first such mechanism is protocol data overhead. Each protocol layer adds a number of bytes to the packet to carry management information between the two endpoints. In a TCP/IP connection (the most common type of connection) this overhead adds 40 bytes to each packet. This is additional data that is subject to the serialisation delay discussed above. The relevance of this overhead is very much dependent on the size of data packets – for example, if data is being sent in 40-byte chunks, the TCP/IP overhead will double the serialisation delay for each chunk, whereas if it’s being sent in 1500-byte chunks, TCP/IP will only increase serialisation delay by around 3%.
It is possible to enable header compression on most modern network devices. This can have a significant impact on packet sizes, reducing the 40-byte TCP/IP header down to 2-4 bytes – for small data payloads this might halve the packet size. However, in latency terms this is unlikely to have much impact as the reduction in serialisation delay would be offset by the time taken to compress and decompress the header at each end of the link.
Clearly, then, if you have a lot of data to send it’s preferable to send it in one large packet rather than lots of small ones. However, it obviously doesn’t make sense from a latency perspective, to delay sending one piece of data until you have more available, just so you can fill a larger packet. Unfortunately, that’s exactly what TCP does in some configuration - using a process called Nagle’s algorithm, a TCP sender will delay sending data until it has enough to fill the largest packet size as long as it still has some data waiting to be acknowledged by the receiver. Thankfully for those with latency-sensitive applications this option can be turned off in most implementations.
One of the uses of data that contributes to protocol overhead is to implement a mechanism called congestion control. In order to prevent networks becoming congested with data that can’t be delivered to a destination, each TCP connection has a data ‘window’ which is the number of bytes that the sender is allowed to transmit before it must wait for permission from the receiver to send more. This permission flows back from the receiver to the sender in the form of acknowledgement or ACKs, which are typically sent when a packet is correctly received. In a well-functioning network, with constant data flow in both directions, this mechanism works extremely well.
There are circumstances, however, where it can introduce problems. Imagine a window size set to 15,000 bytes, and a data producer sending 1,500 byte packets. If the producer is able to send ten packets before the consumer receives the first packet (that is, if the ‘pipe’ can accommodate more than ten packets at a time), then it will have to stop after the tenth and wait until it gets permission to go again. This can add significant latency to some packets. In order to avoid this ‘window exhaustion’ it is necessary to configure the parameters of the TCP stack to align with the network connections – specifically, the window size has to be greater than the delay-bandwidth product, the number you get when you multiply the one-way propagation delay by the data rate.
As an example, a 100 Mb/s link from New York to Chicago with a one-way latency of 12 ms (milliseconds) requires a TCP window size of 1.2 Mbits, or 150 KB. If you had a connection like this, and were using a more standard window size of 64 KB, then your data transfer could stall due to window exhaustion every 5 ms (time taken to send 64 KB at 100Mb/s), with each stall introducing 12 ms of additional latency.
The final way in which protocols can introduce latency is through packet loss. As mentioned earlier, there are occasions when packets need to be queued at switches or routers before they can be forwarded. Physically, this means holding a copy of the packet in memory somewhere. Since network devices have finite memory, there is a limit to the number of packets which can be queued. If a packet arrives and there is no more space available, it will be discarded. In this situation, TCP will eventually detect the missing packet at the receiver, and a re-transmission will occur; however, the delay between original transmission and re-transmission is likely to be on the order of a least three round trip times (RTTs), or six times the one-way propagation delay. As a result, packet loss can be one of the biggest contributors to network latency, albeit on a sporadic basis.
In a network which you own and control the likelihood of packet loss can be minimised by ensuring appropriate capacity is provisioned on all links and switches. If your network includes managed components, especially in a WAN, this is much more difficult to achieve, although your service provider will likely provide some SLAs on packet loss.
It’s worth noting that all of the preceding discussion is focussed on TCP, the Transmission Control Protocol. This protocol was designed to ensure guaranteed delivery of data between applications, rather than timely delivery. Many trading applications use UDP (the User Datagram Protocol) as an alternative to TCP. UDP does not guarantee delivery, so it doesn’t use any of the windowing or retransmission discussed above. As a result, UDP is less latency sensitive, but it is subject to unrecoverable packet loss – if data is discarded due to queuing, there is no way for it to be re-transmitted, of the for the sender to be aware that this happened.
Operating System (OS) Latency
When you’re deploying a trading application, the code has to run on something. That something is typically a set of servers, and those servers have operating systems that sit between your code and the hardware. These OSes are typically designed to provide functionality which makes it easy to run multiple applications on almost any type of hardware; in other words, like TCP, they’re optimised for flexibility and resilience rather than speed. As a result, they can introduce latency through a number of mechanisms.
All modern operating systems are multi-tasking, meaning they can do multiple things ‘simultaneously’. Since servers generally have more things to do than CPUs to do them on, this generally means that any running code can be suspended by the OS to allow another piece of code can be run. This pre-emptive scheduling can introduce variable delays into code. Note that this can be a problem even if your server is only running one application. The reasons is that the application still depends on various inputs and outputs (to users via a keyboard/monitor, to databases, to other servers via the network), and the OS has to make sure the drivers that manage all that I/O get some CPU time. In addition, the OS occasionally has to do some housekeeping work and in some cases this work is non-preemptable, which locks your application out of running until the OS has finished. All told, these types of OS operations can add tens of milliseconds latency to transactions being processed by your application. And, to make matters worse, this latency can be highly intermittent, making it very difficult to address
There are some things you can do to alleviate the problem of OS latency, but most of them are OS-specific – for example, in Windows processes can be set to Realtime priority, or in Linux the kernel can be compiled to allow pre-emption of OS tasks. Neither of these approaches will completely remove OS-latency, but they can reduce it to the millisecond or sub-millisecond region, which may be acceptable for your application. To get beyond this really requires the use of a real-time operating system (RTOS) such as QNX or RTLinux – these are more typically found in embedded systems and are beyond the scope of this article.
Many trading applications are written in programming languages that are executed in a runtime environment that creates yet another layer of complexity between the code and OS. These runtime environments include, for example, Java Virtual Machines (JVMs) for code written in Java, and Windows .Net Common Language Runtime (CLR) for code written in C#. These environments are designed to make code more stable and secure by, among other things, eliminating common programming problems. One of the most common functions of these runtime environments is automatic memory management or garbage collection. This refers to a mechanism whereby the runtime environment monitors the memory being used by an application and periodically tidies it up, reclaiming any memory which the application no longer needs. While this process improves program stability (memory management being one of the most common categories of coding defects), it has a latency cost because the application must be temporarily suspended which garbage collection takes place.
Some work has been done in these environments to minimise the impact of garbage collection, and the Java community has gone as far as creating a separate Real-Time Specification for Java (RTSJ). However, all other things being equal, code that is running in a managed environment will tend to incur more latency problems than code that is directly under OS control. In this case you need to make a trade-off between the improved stability (and possibly faster development cycles) provided by Java/C# and the improved latency of something like C/C++.
Application Latency
Phew – all that latency already in the system and we haven’t even talked about your application yet! Thankfully, application latency is one piece of the equation that’s mostly within your own control, and it also tends to be introduced through a small number of mechanisms.
One common theme in IT systems is that things slow down by at least an order of magnitude when you have to access disks rather than memory – database access is a prime example of this. The reason is simply down to the mechanical nature of disks, as opposed to electronic memory. Designing applications to minimise disk access is therefore a common pattern in low-latency systems – in fact, most high-frequency or flow-based applications will have no databases in the main flow, deferring all data persistence to post-execution. This also tends to mean that applications are very memory hungry – data has to be stored somewhere while it’s being processed, and you don’t want it hitting the disk. Where database access is required and latency has to be minimised many developers are now turning to in-memory databases which, as the name suggests, store all of their data in memory rather than on disk. The increasing penetration of solid-state disks (SSDs – devices which appear to the computer to be a standard magnetic disk, but use non-volatile solid-state memory for storage) provides a possible compromise between in-memory DBs and standard DBs using magnetic disks.
Inter-process communication (IPC) is another area that can have a substantial impact on application performance. Typically, trading applications have multiple components (market data acquisition, pricing engines, risk, order routers, market gateways and many more) and data has to be passed between them. When the processes concerned are on the same server this can be a relatively efficient exchange; when (as is often the case) they are on different servers, then the communication can incur significant latency penalties, as it hits all the OS and protocol overheads discussed previously. Remote Direct Memory Access (RDMA) is a combined hardware/software approach that bypasses all of the OS-related latency penalties by allowing the sending process to write data directly into the memory space of the destination process.