net drive 분석

Linux/Linux Kernel : 2014. 11. 19. 16:42
반응형

출처 : 

http://www.embeddedlinux.org.cn/linux_net/0596002556/understandlni-CHP-10-SECT-7.html

http://www.embeddedlinux.org.cn/linux_net/0596002556/understandlni-CHP-10-SECT-5.html#understandlni-CHP-10-SECT-5.2


10.7. Processing the NET_RX_SOFTIRQ: net_rx_action

net_rx_action is the bottom-half function used to process incoming frames. Its execution is triggered whenever a driver notifies the kernel about the presence of input frames. Figure 10-5 shows the flow of control through the function.

Frames can wait in two places for net_rx_action to process them:


A shared CPU-specific queue

Non-NAPI devices' interrupt handlers, which call netif_rx, place frames into the softnet_data->input_pkt_queue of the CPU on which the interrupt handlers run.


Device memory

The poll method used by NAPI drivers extracts frames directly from the device (or the device driver receive rings).

The section "Old Versus New Driver Interfaces" showed how the kernel is notified about the need to run net_rx_action in both cases.

Figure 10-5. net_rx_action function








The job of net_rx_action is pretty simple: to browse the poll_list list of devices that have something in their ingress queue and invoke for each one the associated poll virtual function until one of the following conditions is met:

  • There are no more devices in the list.

  • net_rx_action has run for too long and therefore it is supposed to release the CPU so that it does not become a CPU hog.

  • The number of frames already dequeued and processed has reached a given upper bound limit (budget). budget is initialized at the beginning of the function to neTDev_max_backlog, which is defined in net/core/dev.c as 300.

As we will see in the next section, net_rx_action calls the driver's poll virtual function and depends partly on this function to obey these constraints.

The size of the queue, as we saw in the section "Managing Queues and Scheduling the Bottom Half," is restricted to the value of neTDev_max_backlog. This value is considered the budget for net_rx_action. However, because net_rx_action runs with interrupts enabled, new frames could be added to a device's input queue while net_rx_action is running. Thus, the number of available frames could become greater than budget, and net_rx_action has to take action to make sure it does not run too long in such cases.

Now we will see in detail what net_rx_action does inside:

static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    unsigned long start_time = jiffies;
    int budget = netdev_max_backlog;

    local_irq_disable( );


If the current device has not yet used its entire quota, it is given a chance to dequeue buffers from its queue with the poll virtual function:

    while (!list_empty(&queue->poll_list)) {
        struct net_device *dev;

        if (budget <= 0 || jiffies - start_time > 1)
            goto softnet_break;

        local_irq_enable( );

        dev = list_entry(queue->poll_list.next, struct net_device, poll_list);


If dev->poll returns because the device quota was not large enough to dequeue all the buffers in the ingress queue (in which case, the return value is nonzero), the device is moved to the end of poll_list:

        if (dev->quota <= 0 || dev->poll(dev, &budget)) {
            local_irq_disable( );
            list_del(&dev->poll_list);
            list_add_tail(&dev->poll_list, &queue->poll_list);
            if (dev->quota < 0)
                dev->quota += dev->weight;
            else
                dev->quota = dev->weight;
        } else {


When instead poll manages to empty the device ingress queue, net_rx_action does not remove the device from poll_listpoll is supposed to take care of it with a call to netif_rx_complete (_ _netif_rx_complete can also be called if IRQs are disabled on the local CPU). This will be illustrated in theprocess_backlog function in the next section.

Furthermore, note that budget was passed by reference to the poll virtual function; this is because that function will return a new budget that reflects the frames it processed. The main loop in net_rx_action checks budget at each pass so that the overall limit is not exceeded. In other words, budget allows net_rx_actionand the poll function to cooperate to stay within their limit.

            dev_put(dev);
            local_irq_disable( );
        }
    }
out:
    local_irq_enable( );
    return;


This last piece of code is executed when net_rx_action is forced to return while buffers are still left in the ingress queue. In this case, the NET_RX_SOFTIRQ softirq is scheduled again for execution so that net_rx_action will be invoked later and will take care of the remaining buffers:

softnet_break:
    _ _get_cpu_var(netdev_rx_stat).time_squeeze++;
    _ _raise_softirq_irqoff(NET_RX_SOFTIRQ);
    goto out;
}


Note that net_rx_action disables interrupts with local_irq_disable only while manipulating the poll_list list of devices to poll (i.e., when accessing its softnet_data structure instance). The netpoll_poll_lock and netpoll_poll_unlock calls, used by the NETPOLL feature, have been omitted. If you can access the kernel source code, see net_rx_action in net/core/dev.c for details.

10.7.1. Backlog Processing: The process_backlog Poll Virtual Function

The poll virtual function of the net_device data structure, which is executed by net_rx_action to process the backlog queue of a device, is initialized by default to process_backlog in net_dev_init for those devices not using NAPI.

As of kernel 2.6.12, only a few device drivers use NAPI, and initialize dev->poll with a pointer to a function of its own: the Broadcom Tigon3 Ethernet driver in drivers/net/tg3.c was the first one to adopt NAPI and is a good example to look at. In this section, we will analyze the default handler process_backlog defined innet/core/dev.c. Its implementation is very similar to that of a poll method of a device driver using NAPI (you can, for instance, compare process_backlog to tg3_poll).

However, since process_backlog can take care of a bunch of devices sharing the same ingress queue, there is one important difference to take into account. When process_backlog runs, hardware interrupts are enabled, so the function could be preempted. For this reason, accesses to the softnet_data structure are always protected by disabling interrupts on the local CPU with local_irq_disable, especially the calls to _ _skb_dequeue. This lock is not needed by a device driver using NAPI:[*] when its poll method is invoked, hardware interrupts are disabled for the device. Moreover, each device has its own queue.

[*] Because each CPU has its own instance of softnet_data, there is no need for extra locking to take care of SMP.

Let's see the main parts of process_backlogFigure 10-6 shows its flowchart.

The function starts with a few initializations:

static int process_backlog(struct net_device *backlog_dev, int *budget)
{
    int work = 0;
    int quota = min(backlog_dev->quota, *budget);
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    unsigned long start_time = jiffies;


Then begins the main loop, which tries to dequeue all the buffers in the input queue and is interrupted only if one of the following conditions is met:

  • The queue becomes empty.

  • The device's quota has been used up.

  • The function has been running for too long.

The last two conditions are similar to the ones that constrain net_rx_action. Because process_backlog is called within a loop in net_rx_action, the latter can respect its constraints only if process_backlog cooperates. For this reason, net_rx_action passes its leftover budget to process_backlog, and the latter sets its quota to the minimum of that input parameter (budget) and its own quota.

budget is initialized by net_rx_action to 300 when it starts. The default value for dev->quota is 64 (and most devices stick with the default). Let's examine a case where several devices have full queues. The first four devices to run within this function receive a value of budget greater than their internal quota of 64, and can empty their queues. The next device may have to stop after sending a part of its queue. That is, the number of buffers dequeued by process_backlog depends both on the device configuration (dev->quota), and on the traffic load on the other devices (budget). This ensures some more fairness among the devices.

Figure 10-6. process_backlog function


    for (;;) {
        struct sk_buff *skb;
        struct net_device *dev;

        local_irq_disable( );
        skb = _ _skb_dequeue(&queue->input_pkt_queue);
        if (!skb)
            goto job_done;
        local_irq_enable( );

        dev = skb->dev;

        netif_receive_skb(skb);

        dev_put(dev);

        work++;
        if (work >= quota || jiffies - start_time > 1)
            break;


netif_receive_skb is the function that processes the frame; it is described in the next section. It is used by all poll virtual functions, both NAPI and non-NAPI.

The device's quota is updated based on the number of buffers successfully dequeued. As explained earlier, the input parameter budget is also updated because it is needed by net_rx_action to keep track of how much work it can continue to do:

    backlog_dev->quota -= work;
    *budget -= work;
    return -1;


The main loop shown earlier jumps to the label job_done if the input queue is emptied. If the function reaches this point, the throttle state can be cleared (if it was set) and the device can be removed from poll_list. The _ _LINK_STATE_RX_SCHED flag is also cleared since the device does not have anything in the input queue and therefore it does not need to be scheduled for backlog processing.

job_done:
    backlog_dev->quota -= work;
    *budget -= work;

    list_del(&backlog_dev->poll_list);
    smp_mb_ _before_clear_bit( );
    netif_poll_enable(backlog_dev);

    if (queue->throttle)
        queue->throttle = 0;
    local_irq_enable( );
    return 0;
}


Actually, there is another difference between process_backlog and a NAPI driver's poll method. Let's return to drivers/net/tg3.c as an example:

    if (done) {
           spin_lock_irqsave(&tp->lock, flags);
        _ _netif_rx_complete(netdev);
        tg3_restart_ints(tp);
           spin_unlock_irqrestore(&tp->lock, flags);
    }


done here is the counterpart of job_done in process_backlog, with the same meaning that the queue is empty. At this point, in the NAPI driver, the _ _netif_rx_complete function (defined in the same file) removes the device from the poll_list list, a task that process_backlog does directly. Finally, the NAPI driver re-enables interrupts for the device. As we anticipated at the beginning of the section, process_backlog runs with interrupts enabled.

10.7.2. Ingress Frame Processing

As mentioned in the previous section, netif_receive_skb is the helper function used by the poll virtual function to process ingress frames. It is illustrated in Figure 10-7.

Multiple protocols are allowed by both L2 and L3. Each device driver is associated with a specific hardware type (e.g., Ethernet), so it is easy for it to interpret the L2 header and extract the information that tells it which L3 protocol is being used, if any (see Chapter 13). When net_rx_action is invoked, the L3 protocol identifier has already been extracted from the L2 header and stored into skb->protocol by the device driver.

The three main tasks of netif_receive_skb are:

  • Passing a copy of the frame to each protocol tap, if any are running

  • Passing a copy of the frame to the L3 protocol handler associated with skb->protocol[*]

    [*] See Chapter 13 for more details on protocol handlers.

  • Taking care of those features that need to be handled at this layer, notably bridging (which is described in Part IV)

If no protocol handler is associated with skb->protocol and none of the features handled in netif_receive_skb (such as bridging) consumes the frame, it is dropped because the kernel doesn't know how to process it.

Before delivering an input frame to these protocol handlers, netif_receive_skb must handle a few features that can change the destiny of the frame.

Figure 10-7. The netif_receive_skb function


Bonding allows a group of interfaces to be grouped together and be treated as a single interface. If the interface from which the frame was received belonged to one such group, the reference to the receiving interface in the sk_buff data structure must be changed to the device in the group with the role of master beforenetif_receive_skb delivers the packet to the L3 handler. This is the purpose of skb_bond.

         skb_bond(skb);


The delivery of the frame to the sniffers and protocol handlers is covered in detail in Chapter 13.

Once all of the protocol sniffers have received their copy of the packet, and before the real protocol handler is given its copy, Diverter, ingress Traffic Control, and bridging features must be handled (see the next section).

When neither the bridging code nor the ingress Traffic Control code consumes the frame, the latter is passed to the L3 protocol handlers (usually there is only one handler per protocol, but multiple ones can be registered). In older kernel versions, this was the only processing needed. The more the kernel network stack was enhanced and the more features that were added (in this layer and in others), the more complex the path of a packet through the network stack became.

At this point, the reception part is complete and it will be up to the L3 protocol handlers to decide what to do with the packets:

  • Deliver them to a recipient (application) running in the receiving workstation.

  • Drop them (for instance, during a failed sanity check).

  • Forward them.

The last choice is common for routers, but not for single-interface workstations. Parts V and VI cover L3 behavior in detail.

The kernel determines from the destination L3 address whether the packet is addressed to its local system. I will postpone a discussion of this process until Part VII; let's take it for granted for the moment that somehow the packet will be delivered to the above layers (i.e., TCP, UDP, ICMP, etc.) if it is addressed to the local system, and to ip_forward otherwise (see Figure 9-2 in Chapter 9).

This finishes our long discussion of how frame reception works. The next chapter describes how frames are transmitted. This second path includes both frames generated locally and received frames that need to be forwarded.

10.7.2.1. Handling special features

netif_receive_skb checks whether any Netpoll client would like to consume the frame.

Traffic Control has always been used to implement QoS on the egress path. However, with recent releases of the kernel, you can configure filters and actions on ingress traffic, too. Based on such a configuration, ing_filter may decide that the input buffer is to be dropped or that it will be processed further somewhere else (i.e., the frame is consumed).

Diverter allows the kernel to change the L2 destination address of frames originally addressed to other hosts so that the frames can be diverted to the local host. There are many possible uses for this feature, as discussed at http://diverter.sourceforge.net. The kernel can be configured to determine the criteria used by Diverter to decide whether to divert a frame. Common criteria used for Diverter include:

  • All IP packets (regardless of L4 protocol)

  • All TCP packets

  • TCP packets with specific port numbers

  • All UDP packets

  • UDP packets with specific port numbers

The call to handle_diverter decides whether to change the destination MAC address. In addition to the change to the destination MAC address, skb->pkt_type must be changed to PACKET_HOST.

Yet another L2 feature could influence the destiny of the frame: Bridging. Bridging, the L2 counterpart of L3 routing, is addressed in Part IV. Each net_device data structure has a pointer to a data structure of type net_bridge_port that is used to store the extra information needed to represent a bridge port. Its value is NULL when the interface has not enabled bridging. When a port is configured as a bridge port, the kernel looks only at L2 headers. The only L3 information the kernel uses in this situation is information pertaining to firewalling.

Since net_rx_action represents the boundary between device drivers and the L3 protocol handlers, it is right in this function that the Bridging feature must be handled. When the kernel has support for bridging, handle_bridge is initialized to a function that checks whether the frame is to be handed to the bridging code. When the frame is handed to the bridging code and the latter consumes it, handle_bridge returns 1. In all other cases, handle_bridge returns 0 and netif_receive_skb will continue processing the frame skb.

if (handle_bridge(skb, &pt_prev, &ret));
    goto out;


반응형
Posted by Real_G